Most vLLM users start on one GPU. The jump to two is not just doubling the box – it changes how requests flow through the engine and which parameters actually affect throughput. On our dedicated hosting we have taken dozens of deployments through this transition. Here is what to expect.
Contents
Why You Scale
Two reasons. First, the model no longer fits on one GPU at the precision or context length you need. Second, you are throughput-limited on one GPU and need more aggregate capacity. The second reason is actually rarer than people think – vLLM continuous batching extracts a lot from a single card before throughput plateaus.
Config Changes
Flip --tensor-parallel-size from 1 to 2. That is the headline change. Two others matter in practice:
--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--enforce-eager false
Memory utilisation can stay high because KV cache is now split across cards. Enforce-eager can be off because CUDA graphs help more on multi-GPU where kernel launch overhead is doubled.
Latency
Single-request latency gets slightly worse. Every forward pass now includes an all-reduce across PCIe. Time-to-first-token on Llama 3 70B INT4 goes from ~650 ms on a single 6000 Pro to ~800 ms on dual 5090s. Inter-token latency goes from ~28 ms to ~35 ms per token. Not huge, but measurable.
Throughput
This is where two GPUs pay back. At batch 64 concurrent requests, aggregate throughput roughly doubles versus a single card that can fit the same model. If you have room for the model on one card and just want to go faster, two cards running data-parallel (separate vLLM instances) is usually a better architecture than tensor parallel – no interconnect tax. See data vs tensor parallel in vLLM.
Scale Out With Fixed Monthly Pricing
Two-GPU dedicated chassis tuned for tensor parallel serving on our UK hosting.
Browse GPU ServersGotchas
Odd vocabulary sizes can trip tensor parallel when the model’s attention head count does not divide evenly by the TP size. Llama 3 works fine at TP=2. Some fine-tuned derivatives with unusual config do not – vLLM will refuse to start with a clear error. Read the error, check config.json, and if head counts do not divide cleanly you need to pad or pick a different TP size.
Quantised models sometimes need specific quantisation flags that single-GPU users never think about. AWQ works well at TP=2 in 2026. GPTQ used to have issues at TP>1; most are resolved. GGUF is generally single-GPU only in vLLM – use llama.cpp instead for GGUF multi-GPU.
See NCCL tuning and dual 5090 Llama 70B deployment for tuning-specific detail.