On a two-GPU dedicated GPU server, there are two ways to use both cards with vLLM. Data parallel means running two separate vLLM processes, each serving the full model on one card. Tensor parallel means one vLLM process splitting the model across both cards. They sound similar. The performance characteristics are very different.
Contents
Data Parallel
Start two vLLM instances on ports 8000 and 8001, one bound to GPU 0 with CUDA_VISIBLE_DEVICES=0 and one to GPU 1. Put nginx or HAProxy in front as a round-robin load balancer. Each request lands on one card. You get 2x throughput with no interconnect overhead. The downside: each card must hold the full model, so VRAM per card limits the model size you can serve.
Tensor Parallel
One vLLM instance with --tensor-parallel-size 2. The model weights are split across both cards. You can serve models that are too large for one card. Every forward pass synchronises across PCIe. At batch 1 it is slightly slower per token than DP. At high concurrency it narrows but rarely matches DP throughput when DP is feasible.
When Each Wins
| Condition | Winner |
|---|---|
| Model fits on one card | Data parallel |
| Model does not fit on one card | Tensor parallel (by default) |
| Latency-sensitive chat | Data parallel if possible |
| High-batch throughput | Both similar at scale |
| Mixed concurrency | Data parallel; better scheduling isolation |
| Single endpoint required | Tensor parallel (no LB) |
Two-GPU Servers Configured for Your Workload
We configure data-parallel or tensor-parallel topologies based on your model and SLA.
Browse GPU ServersThe Practical Stack
For a data parallel setup on two 5080s serving Llama 3 8B:
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct --port 8001 &
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct --port 8002 &
Front with nginx:
upstream vllm {
server 127.0.0.1:8001;
server 127.0.0.1:8002;
}
See load balancer in front of vLLM for the full config.
For the tensor parallel alternative see scaling vLLM across two GPUs. For the architectural question when you have four cards see four-GPU architecture patterns.