RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Model Sharding: Run 70B+ Models Across Multiple GPUs
AI Hosting & Infrastructure

Model Sharding: Run 70B+ Models Across Multiple GPUs

A practical guide to sharding 70B+ parameter models across multiple GPUs, covering VRAM requirements, sharding strategies, configuration examples, and performance scaling.

Why 70B Models Need Sharding

A 70B-parameter model at FP16 requires approximately 140 GB of VRAM for weights alone. Even at 4-bit quantisation, the weights occupy roughly 35 GB, exceeding every consumer GPU currently available. Deploying these models on a dedicated GPU server requires splitting the model across multiple GPUs — a process called model sharding.

Models at this scale deliver substantially better quality than their 7-13B counterparts, making them attractive for production use. The challenge is engineering an efficient multi-GPU deployment that maintains acceptable latency. For teams hosting open-source LLMs, sharding is the gateway to serving the most capable models.

VRAM Requirements by Model Size

Here is how much VRAM you need across different model sizes and GPU configurations using RTX 3090 cards (24 GB each).

ModelQuantWeight SizeKV Cache (batch 4)Total VRAMGPUs Needed
Llama 3 70BAWQ 4-bit~35 GB~10 GB~45 GB2x RTX 3090
Llama 3 70BFP16~140 GB~10 GB~150 GB4x RTX 6000 Pro 96 GB
Mixtral 8x22BAWQ 4-bit~44 GB~12 GB~56 GB3x RTX 3090
Llama 3.1 405BAWQ 4-bit~200 GB~20 GB~220 GB4x RTX 6000 Pro 96 GB
Qwen 2.5 72BGPTQ 4-bit~36 GB~10 GB~46 GB2x RTX 3090

Use the LLM cost calculator to estimate monthly costs for different GPU configurations.

Sharding Strategies for Inference

Tensor parallelism (TP). Splits each layer’s weights across GPUs. All GPUs participate in every token generation. Best for low-latency inference with high-bandwidth GPU interconnects. This is the most common strategy for 2-4 GPU setups.

Pipeline parallelism (PP). Assigns different layers to different GPUs. Data flows through GPUs sequentially. Lower communication overhead but higher per-request latency. Useful when GPU interconnect bandwidth is limited.

Combined TP + PP. For 4+ GPUs, combine both: tensor parallel within a pair of GPUs, pipeline parallel across pairs. This balances communication overhead with scaling efficiency.

For a deeper comparison of these approaches, see our AI hosting infrastructure articles. Both strategies require multi-GPU clusters with fast inter-GPU communication.

Setup Guide: Sharding with vLLM

vLLM handles model sharding automatically once you specify the parallelism degree. Here is how to serve a 70B model across 2 GPUs using vLLM.

# Shard Llama 3 70B across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3-70B-AWQ \
  --tensor-parallel-size 2 \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-num-seqs 8 \
  --port 8000

# Verify both GPUs are loaded
nvidia-smi
# Both GPUs should show ~18-20 GB used each

# For 3 GPUs with Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x22B-AWQ \
  --tensor-parallel-size 3 \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096

The tensor-parallel-size must divide evenly into the model’s attention head count. For Llama 3 70B (80 heads), valid TP sizes are 2, 4, 5, 8, 10. For complete setup, follow our vLLM production setup guide.

Scaling Benchmarks: 2, 3, and 4 GPUs

We measured Llama 3 70B (AWQ 4-bit) throughput across different GPU counts on RTX 3090 clusters.

ConfigurationBatch 1 (tok/s)Batch 4 (tok/s)Batch 8 (tok/s)Scaling Eff.
2x RTX 3090 (TP=2)183852Baseline
3x RTX 3090 (TP=3)2450N/A (VRAM)~89%
4x RTX 3090 (TP=4)286285~78%

Scaling efficiency decreases with more GPUs due to increasing communication overhead. The jump from 2 to 4 GPUs provides 1.6x throughput, not 2x. However, 4 GPUs also unlock much larger batch sizes thanks to the additional KV cache VRAM. Verify expected performance on the tokens per second benchmark.

Production Deployment Tips

Start with the minimum GPUs needed. More GPUs means more communication overhead. If 2 GPUs fit your model and batch requirements, do not use 4.

Use quantisation aggressively. AWQ 4-bit reduces a 70B model from 140 GB to 35 GB, cutting your GPU count from 8+ to 2. The quality trade-off is minimal for most applications. Read the vLLM memory optimisation guide for quantisation details.

Monitor per-GPU utilisation. In a sharded setup, imbalanced GPU utilisation indicates a problem. Both GPUs should show similar compute usage. Use the monitoring techniques in our GPU monitoring guide.

Consider the economics. A 2x RTX 3090 setup serving 70B at ~18 tok/s costs roughly $260/mo. Compare this against API pricing for equivalent models using the GPU vs API cost comparison. At moderate volumes, self-hosted sharded models are significantly cheaper.

Multi-GPU Servers for 70B+ Models

GigaGPU multi-GPU clusters make sharding simple. NVLink-connected GPUs, UK-hosted, with full root access for production LLM serving.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?