Home / Blog / AI Hosting & Infrastructure / Data Parallel vs Tensor Parallel in vLLM

AI Hosting & Infrastructure

Data Parallel vs Tensor Parallel in vLLM

When to run two vLLM instances versus one vLLM instance split across two GPUs - the decision framework.

AI Hosting & Infrastructure April 19, 2026 2 min read admin

On a two-GPU dedicated GPU server, there are two ways to use both cards with vLLM. Data parallel means running two separate vLLM processes, each serving the full model on one card. Tensor parallel means one vLLM process splitting the model across both cards. They sound similar. The performance characteristics are very different.

Data Parallel

Start two vLLM instances on ports 8000 and 8001, one bound to GPU 0 with CUDA_VISIBLE_DEVICES=0 and one to GPU 1. Put nginx or HAProxy in front as a round-robin load balancer. Each request lands on one card. You get 2x throughput with no interconnect overhead. The downside: each card must hold the full model, so VRAM per card limits the model size you can serve.

Tensor Parallel

One vLLM instance with --tensor-parallel-size 2. The model weights are split across both cards. You can serve models that are too large for one card. Every forward pass synchronises across PCIe. At batch 1 it is slightly slower per token than DP. At high concurrency it narrows but rarely matches DP throughput when DP is feasible.

When Each Wins

Condition	Winner
Model fits on one card	Data parallel
Model does not fit on one card	Tensor parallel (by default)
Latency-sensitive chat	Data parallel if possible
High-batch throughput	Both similar at scale
Mixed concurrency	Data parallel; better scheduling isolation
Single endpoint required	Tensor parallel (no LB)

Two-GPU Servers Configured for Your Workload

We configure data-parallel or tensor-parallel topologies based on your model and SLA.

Browse GPU Servers

The Practical Stack

For a data parallel setup on two 5080s serving Llama 3 8B:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct --port 8001 &

CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct --port 8002 &

Front with nginx:

upstream vllm {
  server 127.0.0.1:8001;
  server 127.0.0.1:8002;
}

See load balancer in front of vLLM for the full config.

For the tensor parallel alternative see scaling vLLM across two GPUs. For the architectural question when you have four cards see four-GPU architecture patterns.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

AI Hosting & Infrastructure

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Data Parallel vs Tensor Parallel in vLLM

Contents

Data Parallel

Tensor Parallel

When Each Wins

Two-GPU Servers Configured for Your Workload

The Practical Stack

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Data Parallel vs Tensor Parallel in vLLM

Contents

Data Parallel

Tensor Parallel

When Each Wins

Two-GPU Servers Configured for Your Workload

The Practical Stack

Need a Dedicated GPU Server?

admin

Related Articles

Python Environments on GPU Servers

SSH Hardening for GPU Servers

SGLang vs vLLM in 2026 – Production Comparison

GPU Server Uptime and Reliability: What 99.9% SLA Means for AI

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?