RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Multi-GPU Server Setup for Large Model Inference
AI Hosting & Infrastructure

Multi-GPU Server Setup for Large Model Inference

A complete guide to setting up multi-GPU servers for large model inference. Covers tensor parallelism, pipeline parallelism, hardware selection, vLLM multi-GPU deployment, and scaling strategies.

When You Need Multi-GPU Inference

Large language models like Llama 3 70B, DeepSeek-R1 671B, and Mixtral 8x22B require more VRAM than a single GPU provides. When your model does not fit on one card, multi-GPU configurations split the model across multiple GPUs, enabling inference on models that would otherwise be impossible to run. GigaGPU provides dedicated GPU servers with 2, 4, and 8 GPU configurations specifically designed for large model inference.

Multi-GPU is not just about fitting larger models. Even for models that fit on a single GPU, distributing across multiple GPUs can increase throughput by processing more requests concurrently. For teams running high-volume inference APIs, the throughput gains from multi-GPU setups can be more valuable than the ability to run larger models.

Model Parameters FP16 VRAM Min GPUs (RTX 6000 Pro 96 GB) Min GPUs (RTX 6000 Pro 48GB)
Llama 3 70B 70B ~140 GB 2 4
Mixtral 8x22B 141B (MoE) ~280 GB 4 8
DeepSeek-V3 671B (MoE) ~640 GB 8 N/A
DeepSeek-R1 671B (MoE) ~640 GB 8 N/A
Llama 3 405B 405B ~810 GB Cluster N/A

For models in the 7-30B range that fit on a single GPU, see our single-GPU selection guide. This guide focuses on workloads that demand multiple GPUs.

Multi-GPU Hardware Selection

The interconnect between GPUs is as important as the GPUs themselves. When model layers are split across GPUs, they communicate constantly during inference. Slow interconnects create bottlenecks that negate the benefits of additional GPUs.

Configuration Interconnect Bandwidth Best For
2x RTX 5090 PCIe 4.0 32 GB/s per GPU Budget 70B inference
2x RTX 6000 Pro PCIe 4.0 + NVLink bridge 112 GB/s 70B with better throughput
4x RTX 6000 Pro NVLink + NVSwitch 600 GB/s Mixtral, large MoE models
8x RTX 6000 Pro NVLink + NVSwitch 600 GB/s DeepSeek 671B, largest models
8x RTX 6000 Pro NVLink 4.0 900 GB/s Maximum performance

NVLink-connected GPUs deliver significantly better multi-GPU performance than PCIe-only setups. For 70B models, the difference is 20-40% higher throughput. For 671B MoE models, NVLink is essentially mandatory. The cheapest GPU for AI inference guide covers cost-performance trade-offs across these configurations.

Tensor Parallelism vs Pipeline Parallelism

Two strategies exist for distributing a model across GPUs:

Tensor Parallelism (TP) splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate at every layer during inference. This requires fast interconnects (NVLink preferred) but provides the lowest latency because all GPUs work on the same request simultaneously.

Pipeline Parallelism (PP) assigns different layers to different GPUs. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, and so on. Communication only happens between pipeline stages, so it tolerates slower interconnects. However, GPUs are idle while waiting for the previous stage, creating pipeline bubbles.

# Tensor Parallelism: All GPUs process every request together
# Lower latency, requires fast interconnect
# vLLM: --tensor-parallel-size 4

# Pipeline Parallelism: GPUs process different layers sequentially
# Higher throughput with batching, tolerates slower interconnect
# vLLM: --pipeline-parallel-size 4

# Combined: Use both for very large models on many GPUs
# vLLM: --tensor-parallel-size 4 --pipeline-parallel-size 2 (8 GPUs total)

For most inference deployments, tensor parallelism is preferred because it minimises per-request latency. Use pipeline parallelism when your interconnect bandwidth is limited or when running very large models across more than 4 GPUs.

Deploying vLLM with Multi-GPU

vLLM has first-class support for multi-GPU inference. Deploying a 70B model across 2 RTX 6000 Pros is straightforward:

# Verify all GPUs are detected
nvidia-smi

# Install vLLM
pip install vllm

# Deploy Llama 3 70B with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --dtype auto

For DeepSeek-R1 671B on 8 GPUs:

# Deploy DeepSeek-R1 (full) on 8x RTX 6000 Pro
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --enforce-eager

Validate the deployment with a test request and check that all GPUs show utilization:

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{"role": "user", "content": "Explain tensor parallelism in one paragraph."}],
    "max_tokens": 256
  }'

# Verify all GPUs are active
nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv

For detailed vLLM configuration tuning, see our vLLM production setup guide. For the full DeepSeek deployment walkthrough, see our DeepSeek deployment guide.

Multi-GPU with Ollama

Ollama automatically distributes models across available GPUs when a single GPU lacks sufficient VRAM. For a detailed Ollama setup, see our Ollama on dedicated GPU server tutorial.

# Ollama detects multi-GPU automatically
ollama pull llama3:70b

# Check which GPUs are being used
ollama ps
nvidia-smi

# Force specific GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Ollama’s multi-GPU support is less configurable than vLLM’s. For production multi-GPU deployments where you need fine control over parallelism strategy, batch sizes, and memory allocation, vLLM is the better choice. See our vLLM vs Ollama comparison for detailed guidance.

Monitoring and Debugging Multi-GPU Setups

Multi-GPU deployments require monitoring each GPU independently to catch imbalances and bottlenecks:

# Monitor all GPUs simultaneously
watch -n 2 nvidia-smi

# Detailed per-GPU metrics
nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu \
  --format=csv -l 5

# Check NVLink status and bandwidth
nvidia-smi nvlink --status
nvidia-smi topo --matrix

Common issues and solutions:

  • Uneven GPU utilization: Normal with pipeline parallelism. Consider switching to tensor parallelism
  • OOM on one GPU: The KV cache may not be evenly distributed. Reduce max-model-len or increase gpu-memory-utilization
  • Slow inter-GPU communication: Check nvidia-smi topo --matrix for NVLink connectivity. PCIe-only connections are significantly slower
  • Low throughput despite high GPU count: Communication overhead may dominate. Ensure batch size is large enough to amortise the overhead

Use the tokens-per-second benchmark to validate your multi-GPU setup is delivering expected performance.

Scaling Strategies for Production

As demand grows, you have several scaling options:

Vertical scaling: Add more GPUs to serve larger models or increase throughput on a single server. Move from 2x RTX 6000 Pro to 4x RTX 6000 Pro, or from 4x RTX 6000 Pro to 4x RTX 6000 Pro for more VRAM per card.

Horizontal scaling: Deploy the same model across multiple servers and load balance requests. Each server runs an independent vLLM instance, and Nginx distributes traffic across them.

# Nginx load balancer for multiple multi-GPU servers
upstream llm_cluster {
    least_conn;
    server 10.0.1.10:8000;  # Server 1: 2x RTX 6000 Pro with Llama 70B
    server 10.0.1.11:8000;  # Server 2: 2x RTX 6000 Pro with Llama 70B
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    location /v1/ {
        proxy_pass http://llm_cluster;
        proxy_read_timeout 120s;
    }
}

Hybrid scaling: Run high-volume, predictable traffic on dedicated multi-GPU servers while routing overflow to API endpoints during peak periods. This gives you the cost efficiency of self-hosting with the elasticity of cloud APIs. For cost modelling, our self-hosting breakeven analysis covers the maths behind hybrid architectures.

Explore the AI hosting and infrastructure category for more advanced deployment patterns and architectural guidance.

Deploy Multi-GPU Servers for Large Model Inference

GigaGPU provides multi-GPU dedicated servers with NVLink connectivity, from 2x RTX 6000 Pro to 8x RTX 6000 Pro configurations. Purpose-built for large model inference with the interconnect bandwidth your workload demands.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?