RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Data Parallel vs Tensor Parallel in vLLM
AI Hosting & Infrastructure

Data Parallel vs Tensor Parallel in vLLM

When to run two vLLM instances versus one vLLM instance split across two GPUs - the decision framework.

On a two-GPU dedicated GPU server, there are two ways to use both cards with vLLM. Data parallel means running two separate vLLM processes, each serving the full model on one card. Tensor parallel means one vLLM process splitting the model across both cards. They sound similar. The performance characteristics are very different.

Contents

Data Parallel

Start two vLLM instances on ports 8000 and 8001, one bound to GPU 0 with CUDA_VISIBLE_DEVICES=0 and one to GPU 1. Put nginx or HAProxy in front as a round-robin load balancer. Each request lands on one card. You get 2x throughput with no interconnect overhead. The downside: each card must hold the full model, so VRAM per card limits the model size you can serve.

Tensor Parallel

One vLLM instance with --tensor-parallel-size 2. The model weights are split across both cards. You can serve models that are too large for one card. Every forward pass synchronises across PCIe. At batch 1 it is slightly slower per token than DP. At high concurrency it narrows but rarely matches DP throughput when DP is feasible.

When Each Wins

ConditionWinner
Model fits on one cardData parallel
Model does not fit on one cardTensor parallel (by default)
Latency-sensitive chatData parallel if possible
High-batch throughputBoth similar at scale
Mixed concurrencyData parallel; better scheduling isolation
Single endpoint requiredTensor parallel (no LB)

Two-GPU Servers Configured for Your Workload

We configure data-parallel or tensor-parallel topologies based on your model and SLA.

Browse GPU Servers

The Practical Stack

For a data parallel setup on two 5080s serving Llama 3 8B:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct --port 8001 &

CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct --port 8002 &

Front with nginx:

upstream vllm {
  server 127.0.0.1:8001;
  server 127.0.0.1:8002;
}

See load balancer in front of vLLM for the full config.

For the tensor parallel alternative see scaling vLLM across two GPUs. For the architectural question when you have four cards see four-GPU architecture patterns.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?