When You Need Multi-GPU Inference
Large language models like Llama 3 70B, DeepSeek-R1 671B, and Mixtral 8x22B require more VRAM than a single GPU provides. When your model does not fit on one card, multi-GPU configurations split the model across multiple GPUs, enabling inference on models that would otherwise be impossible to run. GigaGPU provides dedicated GPU servers with 2, 4, and 8 GPU configurations specifically designed for large model inference.
Multi-GPU is not just about fitting larger models. Even for models that fit on a single GPU, distributing across multiple GPUs can increase throughput by processing more requests concurrently. For teams running high-volume inference APIs, the throughput gains from multi-GPU setups can be more valuable than the ability to run larger models.
| Model | Parameters | FP16 VRAM | Min GPUs (RTX 6000 Pro 96 GB) | Min GPUs (RTX 6000 Pro 48GB) |
|---|---|---|---|---|
| Llama 3 70B | 70B | ~140 GB | 2 | 4 |
| Mixtral 8x22B | 141B (MoE) | ~280 GB | 4 | 8 |
| DeepSeek-V3 | 671B (MoE) | ~640 GB | 8 | N/A |
| DeepSeek-R1 | 671B (MoE) | ~640 GB | 8 | N/A |
| Llama 3 405B | 405B | ~810 GB | Cluster | N/A |
For models in the 7-30B range that fit on a single GPU, see our single-GPU selection guide. This guide focuses on workloads that demand multiple GPUs.
Multi-GPU Hardware Selection
The interconnect between GPUs is as important as the GPUs themselves. When model layers are split across GPUs, they communicate constantly during inference. Slow interconnects create bottlenecks that negate the benefits of additional GPUs.
| Configuration | Interconnect | Bandwidth | Best For |
|---|---|---|---|
| 2x RTX 5090 | PCIe 4.0 | 32 GB/s per GPU | Budget 70B inference |
| 2x RTX 6000 Pro | PCIe 4.0 + NVLink bridge | 112 GB/s | 70B with better throughput |
| 4x RTX 6000 Pro | NVLink + NVSwitch | 600 GB/s | Mixtral, large MoE models |
| 8x RTX 6000 Pro | NVLink + NVSwitch | 600 GB/s | DeepSeek 671B, largest models |
| 8x RTX 6000 Pro | NVLink 4.0 | 900 GB/s | Maximum performance |
NVLink-connected GPUs deliver significantly better multi-GPU performance than PCIe-only setups. For 70B models, the difference is 20-40% higher throughput. For 671B MoE models, NVLink is essentially mandatory. The cheapest GPU for AI inference guide covers cost-performance trade-offs across these configurations.
Tensor Parallelism vs Pipeline Parallelism
Two strategies exist for distributing a model across GPUs:
Tensor Parallelism (TP) splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate at every layer during inference. This requires fast interconnects (NVLink preferred) but provides the lowest latency because all GPUs work on the same request simultaneously.
Pipeline Parallelism (PP) assigns different layers to different GPUs. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, and so on. Communication only happens between pipeline stages, so it tolerates slower interconnects. However, GPUs are idle while waiting for the previous stage, creating pipeline bubbles.
# Tensor Parallelism: All GPUs process every request together
# Lower latency, requires fast interconnect
# vLLM: --tensor-parallel-size 4
# Pipeline Parallelism: GPUs process different layers sequentially
# Higher throughput with batching, tolerates slower interconnect
# vLLM: --pipeline-parallel-size 4
# Combined: Use both for very large models on many GPUs
# vLLM: --tensor-parallel-size 4 --pipeline-parallel-size 2 (8 GPUs total)
For most inference deployments, tensor parallelism is preferred because it minimises per-request latency. Use pipeline parallelism when your interconnect bandwidth is limited or when running very large models across more than 4 GPUs.
Deploying vLLM with Multi-GPU
vLLM has first-class support for multi-GPU inference. Deploying a 70B model across 2 RTX 6000 Pros is straightforward:
# Verify all GPUs are detected
nvidia-smi
# Install vLLM
pip install vllm
# Deploy Llama 3 70B with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype auto
For DeepSeek-R1 671B on 8 GPUs:
# Deploy DeepSeek-R1 (full) on 8x RTX 6000 Pro
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--trust-remote-code \
--enforce-eager
Validate the deployment with a test request and check that all GPUs show utilization:
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [{"role": "user", "content": "Explain tensor parallelism in one paragraph."}],
"max_tokens": 256
}'
# Verify all GPUs are active
nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv
For detailed vLLM configuration tuning, see our vLLM production setup guide. For the full DeepSeek deployment walkthrough, see our DeepSeek deployment guide.
Multi-GPU with Ollama
Ollama automatically distributes models across available GPUs when a single GPU lacks sufficient VRAM. For a detailed Ollama setup, see our Ollama on dedicated GPU server tutorial.
# Ollama detects multi-GPU automatically
ollama pull llama3:70b
# Check which GPUs are being used
ollama ps
nvidia-smi
# Force specific GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve
Ollama’s multi-GPU support is less configurable than vLLM’s. For production multi-GPU deployments where you need fine control over parallelism strategy, batch sizes, and memory allocation, vLLM is the better choice. See our vLLM vs Ollama comparison for detailed guidance.
Monitoring and Debugging Multi-GPU Setups
Multi-GPU deployments require monitoring each GPU independently to catch imbalances and bottlenecks:
# Monitor all GPUs simultaneously
watch -n 2 nvidia-smi
# Detailed per-GPU metrics
nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu \
--format=csv -l 5
# Check NVLink status and bandwidth
nvidia-smi nvlink --status
nvidia-smi topo --matrix
Common issues and solutions:
- Uneven GPU utilization: Normal with pipeline parallelism. Consider switching to tensor parallelism
- OOM on one GPU: The KV cache may not be evenly distributed. Reduce
max-model-lenor increasegpu-memory-utilization - Slow inter-GPU communication: Check
nvidia-smi topo --matrixfor NVLink connectivity. PCIe-only connections are significantly slower - Low throughput despite high GPU count: Communication overhead may dominate. Ensure batch size is large enough to amortise the overhead
Use the tokens-per-second benchmark to validate your multi-GPU setup is delivering expected performance.
Scaling Strategies for Production
As demand grows, you have several scaling options:
Vertical scaling: Add more GPUs to serve larger models or increase throughput on a single server. Move from 2x RTX 6000 Pro to 4x RTX 6000 Pro, or from 4x RTX 6000 Pro to 4x RTX 6000 Pro for more VRAM per card.
Horizontal scaling: Deploy the same model across multiple servers and load balance requests. Each server runs an independent vLLM instance, and Nginx distributes traffic across them.
# Nginx load balancer for multiple multi-GPU servers
upstream llm_cluster {
least_conn;
server 10.0.1.10:8000; # Server 1: 2x RTX 6000 Pro with Llama 70B
server 10.0.1.11:8000; # Server 2: 2x RTX 6000 Pro with Llama 70B
}
server {
listen 443 ssl http2;
server_name api.yourdomain.com;
location /v1/ {
proxy_pass http://llm_cluster;
proxy_read_timeout 120s;
}
}
Hybrid scaling: Run high-volume, predictable traffic on dedicated multi-GPU servers while routing overflow to API endpoints during peak periods. This gives you the cost efficiency of self-hosting with the elasticity of cloud APIs. For cost modelling, our self-hosting breakeven analysis covers the maths behind hybrid architectures.
Explore the AI hosting and infrastructure category for more advanced deployment patterns and architectural guidance.
Deploy Multi-GPU Servers for Large Model Inference
GigaGPU provides multi-GPU dedicated servers with NVLink connectivity, from 2x RTX 6000 Pro to 8x RTX 6000 Pro configurations. Purpose-built for large model inference with the interconnect bandwidth your workload demands.
Browse GPU Servers