Nvidia reserves NVLink for datacenter SKUs. Consumer 5090 and workstation 6000 Pro cards on our dedicated hosting do not have NVLink. Model parallelism still works over PCIe. The performance characteristics are well-understood and usually acceptable.
Contents
Why NVLink Matters
NVLink delivers ~900 GB/s between paired cards. Tensor parallel’s all-reduce step uses that bandwidth for every transformer layer. With NVLink, the all-reduce barely registers. With PCIe (32-64 GB/s), the all-reduce is a visible fraction of each forward pass.
For datacenter training, NVLink saves hours. For consumer-card inference, PCIe is adequate.
PCIe Alternatives
Gen 4 x16 PCIe: ~32 GB/s. Gen 5 x16: ~64 GB/s. With both cards direct-to-CPU at x16 Gen 5, NCCL all-reduce hits 40-50 GB/s aggregate. That is roughly 6-10% of NVLink but still fast enough for interactive inference.
Inference Numbers
Llama 3 70B INT4 tensor-parallel on two 5090s:
| Setup | Batch 1 t/s | Batch 16 agg t/s |
|---|---|---|
| Hypothetical NVLink | ~35 | ~450 |
| Actual PCIe Gen 5 x16 | ~28 | ~420 |
| PCIe Gen 4 x16 | ~24 | ~380 |
| PCIe Gen 4 x4 (pinched) | ~15 | ~200 |
Full x16 at Gen 4 or 5 is the practical target. Anything less starves the link.
Full x16 Multi-GPU Chassis
Every GPU on our multi-card servers gets full bandwidth – no quietly-pinched lanes.
Browse GPU ServersWorkarounds
For training workloads that genuinely suffer without NVLink, consider these:
- Use one bigger GPU instead of two smaller ones (avoids the problem).
- Switch from tensor parallel to ZeRO-3 / FSDP, which is less bandwidth-intensive.
- Use gradient accumulation to reduce how often all-reduce fires.
For inference, the PCIe setup is almost always fine. See NCCL tuning and PCIe lanes guide.