RTX 3050 - Order Now
Home / Blog / Tutorials / Scaling vLLM Across Two GPUs – What Actually Changes
Tutorials

Scaling vLLM Across Two GPUs – What Actually Changes

Moving from single-GPU vLLM to two-GPU tensor parallel changes throughput, latency, memory layout, and a few knobs you will not expect.

Most vLLM users start on one GPU. The jump to two is not just doubling the box – it changes how requests flow through the engine and which parameters actually affect throughput. On our dedicated hosting we have taken dozens of deployments through this transition. Here is what to expect.

Contents

Why You Scale

Two reasons. First, the model no longer fits on one GPU at the precision or context length you need. Second, you are throughput-limited on one GPU and need more aggregate capacity. The second reason is actually rarer than people think – vLLM continuous batching extracts a lot from a single card before throughput plateaus.

Config Changes

Flip --tensor-parallel-size from 1 to 2. That is the headline change. Two others matter in practice:

--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--enforce-eager false

Memory utilisation can stay high because KV cache is now split across cards. Enforce-eager can be off because CUDA graphs help more on multi-GPU where kernel launch overhead is doubled.

Latency

Single-request latency gets slightly worse. Every forward pass now includes an all-reduce across PCIe. Time-to-first-token on Llama 3 70B INT4 goes from ~650 ms on a single 6000 Pro to ~800 ms on dual 5090s. Inter-token latency goes from ~28 ms to ~35 ms per token. Not huge, but measurable.

Throughput

This is where two GPUs pay back. At batch 64 concurrent requests, aggregate throughput roughly doubles versus a single card that can fit the same model. If you have room for the model on one card and just want to go faster, two cards running data-parallel (separate vLLM instances) is usually a better architecture than tensor parallel – no interconnect tax. See data vs tensor parallel in vLLM.

Scale Out With Fixed Monthly Pricing

Two-GPU dedicated chassis tuned for tensor parallel serving on our UK hosting.

Browse GPU Servers

Gotchas

Odd vocabulary sizes can trip tensor parallel when the model’s attention head count does not divide evenly by the TP size. Llama 3 works fine at TP=2. Some fine-tuned derivatives with unusual config do not – vLLM will refuse to start with a clear error. Read the error, check config.json, and if head counts do not divide cleanly you need to pad or pick a different TP size.

Quantised models sometimes need specific quantisation flags that single-GPU users never think about. AWQ works well at TP=2 in 2026. GPTQ used to have issues at TP>1; most are resolved. GGUF is generally single-GPU only in vLLM – use llama.cpp instead for GGUF multi-GPU.

See NCCL tuning and dual 5090 Llama 70B deployment for tuning-specific detail.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?