RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Tensor Parallelism Not Working: Fix Guide
Tutorials

vLLM Tensor Parallelism Not Working: Fix Guide

Fix vLLM tensor parallelism failures including NCCL errors, GPU visibility issues, uneven memory distribution, and configuration problems on multi-GPU servers.

Tensor Parallelism Failures in vLLM

You set --tensor-parallel-size 2 on your multi-GPU server and vLLM crashes at startup:

RuntimeError: NCCL error in: ProcessGroupNCCL.cpp:123, unhandled system error
ValueError: tensor-parallel-size (4) must divide the number of attention heads (32)
torch.cuda.OutOfMemoryError on device cuda:1

Tensor parallelism splits model layers across GPUs so that models too large for a single GPU can still be served. When it works, a 70B model runs across four GPUs as if it were a single larger device. When it fails, the errors are often cryptic and multi-layered.

Step 1: Verify All GPUs Are Visible

# Check GPU count
nvidia-smi -L

# Verify vLLM sees all GPUs
python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"

If the GPU count is less than your tensor-parallel-size, vLLM cannot distribute the model. Common causes: CUDA_VISIBLE_DEVICES is set restrictively, or the Docker container was launched with --gpus '"device=0"' instead of --gpus all.

Step 2: Fix NCCL Communication Errors

Tensor parallelism uses NCCL for inter-GPU communication. If NCCL fails:

# Set explicit network interface
export NCCL_SOCKET_IFNAME=eth0

# Increase NCCL timeout for large model loading
export NCCL_TIMEOUT=1800

# Enable debug logging to find the specific failure
export NCCL_DEBUG=INFO

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

Check GPU topology to ensure peer-to-peer communication is possible:

nvidia-smi topo -m

GPUs connected via NVLink show “NV” connections, which are fastest. GPUs on different PCIe switches show “PHB” or “SYS”, which are slower but still functional.

Step 3: Fix Attention Head Count Errors

The tensor parallel size must evenly divide the model’s number of attention heads. Most modern LLMs have 32 or 40 heads:

# 32 heads → valid TP sizes: 1, 2, 4, 8, 16, 32
# 40 heads → valid TP sizes: 1, 2, 4, 5, 8, 10, 20, 40

Check your model’s head count:

python -c "
from transformers import AutoConfig
config = AutoConfig.from_pretrained('meta-llama/Meta-Llama-3.1-70B-Instruct')
print(f'Attention heads: {config.num_attention_heads}')
print(f'KV heads: {config.num_key_value_heads}')
"

For GQA models (where KV heads differ from attention heads), the TP size must divide the KV head count.

Step 4: Fix Uneven Memory Distribution

If one GPU runs out of memory while others have headroom, the memory distribution is uneven. This happens when GPUs have different amounts of VRAM or when another process occupies memory on one GPU:

# Check per-GPU memory usage
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

# Kill any stray processes
nvidia-smi --query-compute-apps=pid,gpu_uuid --format=csv
kill 

All GPUs in a tensor parallel group must have equal free VRAM. On your dedicated GPU server, ensure no other workloads are using the GPUs assigned to vLLM.

Optimal Tensor Parallel Configuration

# 70B model across 4 GPUs (24 GB each)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --dtype float16

For best performance on vLLM:

  • Use the smallest TP size that fits the model. TP=2 for 70B on two RTX 6000 Pro-80GB is better than TP=4 on four RTX 5090s, because fewer GPUs means less communication overhead.
  • Prefer NVLink-connected GPUs over PCIe-connected ones for TP communication.
  • If the model fits on a single GPU, do not use tensor parallelism at all. Run separate instances per GPU with a load balancer for better aggregate throughput.

Verification

# Check that all GPUs are active
nvidia-smi
# All GPUs in the TP group should show vLLM memory usage

# Send a test request
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Meta-Llama-3.1-70B-Instruct","prompt":"Hello","max_tokens":50}'

For production deployment, follow our vLLM production guide. Tune memory settings using the optimization guide, and configure monitoring across all GPUs in the parallel group. Secure the endpoint with API authentication.

Multi-GPU Servers for Large Models

GigaGPU offers multi-GPU dedicated servers with NVLink interconnects, ideal for tensor-parallel vLLM deployments.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?