Home / Blog / AI Hosting & Infrastructure / Model Sharding: Run 70B+ Models Across Multiple GPUs

AI Hosting & Infrastructure

Model Sharding: Run 70B+ Models Across Multiple GPUs

A practical guide to sharding 70B+ parameter models across multiple GPUs, covering VRAM requirements, sharding strategies, configuration examples, and performance scaling.

AI Hosting & Infrastructure April 17, 2026 3 min read admin

Table of Contents

Why 70B Models Need Sharding
VRAM Requirements by Model Size
Sharding Strategies for Inference
Setup Guide: Sharding with vLLM
Scaling Benchmarks: 2, 3, and 4 GPUs
Production Deployment Tips

Why 70B Models Need Sharding

A 70B-parameter model at FP16 requires approximately 140 GB of VRAM for weights alone. Even at 4-bit quantisation, the weights occupy roughly 35 GB, exceeding every consumer GPU currently available. Deploying these models on a dedicated GPU server requires splitting the model across multiple GPUs — a process called model sharding.

Models at this scale deliver substantially better quality than their 7-13B counterparts, making them attractive for production use. The challenge is engineering an efficient multi-GPU deployment that maintains acceptable latency. For teams hosting open-source LLMs, sharding is the gateway to serving the most capable models.

VRAM Requirements by Model Size

Here is how much VRAM you need across different model sizes and GPU configurations using RTX 3090 cards (24 GB each).

Model	Quant	Weight Size	KV Cache (batch 4)	Total VRAM	GPUs Needed
Llama 3 70B	AWQ 4-bit	~35 GB	~10 GB	~45 GB	2x RTX 3090
Llama 3 70B	FP16	~140 GB	~10 GB	~150 GB	4x RTX 6000 Pro 96 GB
Mixtral 8x22B	AWQ 4-bit	~44 GB	~12 GB	~56 GB	3x RTX 3090
Llama 3.1 405B	AWQ 4-bit	~200 GB	~20 GB	~220 GB	4x RTX 6000 Pro 96 GB
Qwen 2.5 72B	GPTQ 4-bit	~36 GB	~10 GB	~46 GB	2x RTX 3090

Use the LLM cost calculator to estimate monthly costs for different GPU configurations.

Sharding Strategies for Inference

Tensor parallelism (TP). Splits each layer’s weights across GPUs. All GPUs participate in every token generation. Best for low-latency inference with high-bandwidth GPU interconnects. This is the most common strategy for 2-4 GPU setups.

Pipeline parallelism (PP). Assigns different layers to different GPUs. Data flows through GPUs sequentially. Lower communication overhead but higher per-request latency. Useful when GPU interconnect bandwidth is limited.

Combined TP + PP. For 4+ GPUs, combine both: tensor parallel within a pair of GPUs, pipeline parallel across pairs. This balances communication overhead with scaling efficiency.

For a deeper comparison of these approaches, see our AI hosting infrastructure articles. Both strategies require multi-GPU clusters with fast inter-GPU communication.

Setup Guide: Sharding with vLLM

vLLM handles model sharding automatically once you specify the parallelism degree. Here is how to serve a 70B model across 2 GPUs using vLLM.

# Shard Llama 3 70B across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3-70B-AWQ \
  --tensor-parallel-size 2 \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-num-seqs 8 \
  --port 8000

# Verify both GPUs are loaded
nvidia-smi
# Both GPUs should show ~18-20 GB used each

# For 3 GPUs with Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x22B-AWQ \
  --tensor-parallel-size 3 \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096

The tensor-parallel-size must divide evenly into the model’s attention head count. For Llama 3 70B (80 heads), valid TP sizes are 2, 4, 5, 8, 10. For complete setup, follow our vLLM production setup guide.

Scaling Benchmarks: 2, 3, and 4 GPUs

We measured Llama 3 70B (AWQ 4-bit) throughput across different GPU counts on RTX 3090 clusters.

Configuration	Batch 1 (tok/s)	Batch 4 (tok/s)	Batch 8 (tok/s)	Scaling Eff.
2x RTX 3090 (TP=2)	18	38	52	Baseline
3x RTX 3090 (TP=3)	24	50	N/A (VRAM)	~89%
4x RTX 3090 (TP=4)	28	62	85	~78%

Scaling efficiency decreases with more GPUs due to increasing communication overhead. The jump from 2 to 4 GPUs provides 1.6x throughput, not 2x. However, 4 GPUs also unlock much larger batch sizes thanks to the additional KV cache VRAM. Verify expected performance on the tokens per second benchmark.

Production Deployment Tips

Start with the minimum GPUs needed. More GPUs means more communication overhead. If 2 GPUs fit your model and batch requirements, do not use 4.

Use quantisation aggressively. AWQ 4-bit reduces a 70B model from 140 GB to 35 GB, cutting your GPU count from 8+ to 2. The quality trade-off is minimal for most applications. Read the vLLM memory optimisation guide for quantisation details.

Monitor per-GPU utilisation. In a sharded setup, imbalanced GPU utilisation indicates a problem. Both GPUs should show similar compute usage. Use the monitoring techniques in our GPU monitoring guide.

Consider the economics. A 2x RTX 3090 setup serving 70B at ~18 tok/s costs roughly $260/mo. Compare this against API pricing for equivalent models using the GPU vs API cost comparison. At moderate volumes, self-hosted sharded models are significantly cheaper.

Multi-GPU Servers for 70B+ Models

GigaGPU multi-GPU clusters make sharding simple. NVLink-connected GPUs, UK-hosted, with full root access for production LLM serving.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

AI Hosting & Infrastructure

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Model Sharding: Run 70B+ Models Across Multiple GPUs

Why 70B Models Need Sharding

VRAM Requirements by Model Size

Sharding Strategies for Inference

Setup Guide: Sharding with vLLM

Scaling Benchmarks: 2, 3, and 4 GPUs

Production Deployment Tips

Multi-GPU Servers for 70B+ Models

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Model Sharding: Run 70B+ Models Across Multiple GPUs

Why 70B Models Need Sharding

VRAM Requirements by Model Size

Sharding Strategies for Inference

Setup Guide: Sharding with vLLM

Scaling Benchmarks: 2, 3, and 4 GPUs

Production Deployment Tips

Multi-GPU Servers for 70B+ Models

Need a Dedicated GPU Server?

admin

Related Articles

GDPR-Compliant AI Inference: UK GPU Guide

SGLang vs vLLM in 2026 – Production Comparison

GPU Server for 250 Concurrent Voice agent Users: Sizing Guide

SOC 2 Compliance for AI Hosting

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?