Table of Contents
Why 70B Models Need Sharding
A 70B-parameter model at FP16 requires approximately 140 GB of VRAM for weights alone. Even at 4-bit quantisation, the weights occupy roughly 35 GB, exceeding every consumer GPU currently available. Deploying these models on a dedicated GPU server requires splitting the model across multiple GPUs — a process called model sharding.
Models at this scale deliver substantially better quality than their 7-13B counterparts, making them attractive for production use. The challenge is engineering an efficient multi-GPU deployment that maintains acceptable latency. For teams hosting open-source LLMs, sharding is the gateway to serving the most capable models.
VRAM Requirements by Model Size
Here is how much VRAM you need across different model sizes and GPU configurations using RTX 3090 cards (24 GB each).
| Model | Quant | Weight Size | KV Cache (batch 4) | Total VRAM | GPUs Needed |
|---|---|---|---|---|---|
| Llama 3 70B | AWQ 4-bit | ~35 GB | ~10 GB | ~45 GB | 2x RTX 3090 |
| Llama 3 70B | FP16 | ~140 GB | ~10 GB | ~150 GB | 4x RTX 6000 Pro 96 GB |
| Mixtral 8x22B | AWQ 4-bit | ~44 GB | ~12 GB | ~56 GB | 3x RTX 3090 |
| Llama 3.1 405B | AWQ 4-bit | ~200 GB | ~20 GB | ~220 GB | 4x RTX 6000 Pro 96 GB |
| Qwen 2.5 72B | GPTQ 4-bit | ~36 GB | ~10 GB | ~46 GB | 2x RTX 3090 |
Use the LLM cost calculator to estimate monthly costs for different GPU configurations.
Sharding Strategies for Inference
Tensor parallelism (TP). Splits each layer’s weights across GPUs. All GPUs participate in every token generation. Best for low-latency inference with high-bandwidth GPU interconnects. This is the most common strategy for 2-4 GPU setups.
Pipeline parallelism (PP). Assigns different layers to different GPUs. Data flows through GPUs sequentially. Lower communication overhead but higher per-request latency. Useful when GPU interconnect bandwidth is limited.
Combined TP + PP. For 4+ GPUs, combine both: tensor parallel within a pair of GPUs, pipeline parallel across pairs. This balances communication overhead with scaling efficiency.
For a deeper comparison of these approaches, see our AI hosting infrastructure articles. Both strategies require multi-GPU clusters with fast inter-GPU communication.
Setup Guide: Sharding with vLLM
vLLM handles model sharding automatically once you specify the parallelism degree. Here is how to serve a 70B model across 2 GPUs using vLLM.
# Shard Llama 3 70B across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3-70B-AWQ \
--tensor-parallel-size 2 \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--max-num-seqs 8 \
--port 8000
# Verify both GPUs are loaded
nvidia-smi
# Both GPUs should show ~18-20 GB used each
# For 3 GPUs with Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mixtral-8x22B-AWQ \
--tensor-parallel-size 3 \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 4096
The tensor-parallel-size must divide evenly into the model’s attention head count. For Llama 3 70B (80 heads), valid TP sizes are 2, 4, 5, 8, 10. For complete setup, follow our vLLM production setup guide.
Scaling Benchmarks: 2, 3, and 4 GPUs
We measured Llama 3 70B (AWQ 4-bit) throughput across different GPU counts on RTX 3090 clusters.
| Configuration | Batch 1 (tok/s) | Batch 4 (tok/s) | Batch 8 (tok/s) | Scaling Eff. |
|---|---|---|---|---|
| 2x RTX 3090 (TP=2) | 18 | 38 | 52 | Baseline |
| 3x RTX 3090 (TP=3) | 24 | 50 | N/A (VRAM) | ~89% |
| 4x RTX 3090 (TP=4) | 28 | 62 | 85 | ~78% |
Scaling efficiency decreases with more GPUs due to increasing communication overhead. The jump from 2 to 4 GPUs provides 1.6x throughput, not 2x. However, 4 GPUs also unlock much larger batch sizes thanks to the additional KV cache VRAM. Verify expected performance on the tokens per second benchmark.
Production Deployment Tips
Start with the minimum GPUs needed. More GPUs means more communication overhead. If 2 GPUs fit your model and batch requirements, do not use 4.
Use quantisation aggressively. AWQ 4-bit reduces a 70B model from 140 GB to 35 GB, cutting your GPU count from 8+ to 2. The quality trade-off is minimal for most applications. Read the vLLM memory optimisation guide for quantisation details.
Monitor per-GPU utilisation. In a sharded setup, imbalanced GPU utilisation indicates a problem. Both GPUs should show similar compute usage. Use the monitoring techniques in our GPU monitoring guide.
Consider the economics. A 2x RTX 3090 setup serving 70B at ~18 tok/s costs roughly $260/mo. Compare this against API pricing for equivalent models using the GPU vs API cost comparison. At moderate volumes, self-hosted sharded models are significantly cheaper.
Multi-GPU Servers for 70B+ Models
GigaGPU multi-GPU clusters make sharding simple. NVLink-connected GPUs, UK-hosted, with full root access for production LLM serving.
Browse GPU Servers