Multi-GPU dedicated servers need a way to move tensors between cards. The options vary by platform. Consumer Nvidia cards on our dedicated hosting do not have NVLink in 2026 – Nvidia removed it from consumer SKUs years back. The practical interconnect is PCIe. Here is how that plays out.
Sections
- The three interconnect paths
- PCIe peer-to-peer in detail
- Performance implications
- Which matters for your workload
The Three Paths
NVLink / NVSwitch: Datacenter GPUs (H100, A100). Not available on consumer 5090/6000 Pro in 2026.
PCIe peer-to-peer: Direct GPU-to-GPU transfers over the PCIe bus, bypassing the CPU. Works on most modern dedicated servers when BIOS ACS is configured appropriately.
CPU-staged transfers: Tensor goes GPU -> CPU RAM -> GPU. Slowest path. Used when peer-to-peer is blocked by IOMMU or ACS settings.
PCIe Peer-to-Peer
A Gen 4 x16 PCIe link provides ~32 GB/s theoretical, roughly 24-28 GB/s practical. Gen 5 doubles this. For two cards in a dedicated server both running at x16 Gen 4, NCCL all-reduce hits 40-50 GB/s aggregate (summing both directions) which is good enough for tensor-parallel inference at interactive speeds.
PCIe P2P is not guaranteed – it requires the PCIe root complex to allow it, usually means ACS Override or similar BIOS settings. Our dedicated servers ship with these configured for GPU workloads by default.
Performance
| Workload | NVLink 900 GB/s | PCIe Gen 5 x16 ~64 GB/s | PCIe Gen 4 x16 ~32 GB/s |
|---|---|---|---|
| TP inference 70B | Baseline | ~10-20% slower | ~25-35% slower |
| FSDP training 13B | Baseline | ~30% slower | ~50% slower |
| Data parallel (no tensor sync) | No benefit | No benefit | No benefit |
Multi-GPU Servers With Peer-to-Peer Enabled
ACS-configured PCIe for real NCCL bandwidth on UK dedicated hosting.
Browse GPU ServersWhich Matters
If you are running data parallel (independent replicas), interconnect does not matter – there is no tensor sync. If you are running tensor parallel inference, PCIe Gen 4 x16 is adequate for 2-GPU servers; Gen 5 helps at 4+ GPUs. If you are training, interconnect matters most – consider whether a single-card solution on a larger GPU avoids the problem entirely. See PCIe lanes guide and NCCL tuning.