Llama 3.1 70B is the most-asked “can I run this?” model of the last twelve months. Quantised to INT4 it is tantalisingly close to fitting on consumer hardware, but close is not close enough. This guide nails down the exact weight sizes for AWQ, GPTQ, and GGUF Q4_K_M; explains why a 16 GB RTX 5060 Ti cannot host 70B; and lists the 48 GB+ cards that can on our UK dedicated GPU hosting.
Contents
- Exact weight sizes at INT4
- KV cache and activation overhead
- Which GPUs can host it
- CPU/disk offload options
- When to upgrade
- Alternatives that fit on 16 GB
Exact weight sizes at INT4
70B parameters at a nominal 4 bits per weight is 35 GB before metadata. Each quantiser adds overhead for group scales, zeros, and in some cases retained high-precision outlier weights:
| Format | Effective bits/weight | Weight size | Notes |
|---|---|---|---|
| AWQ 4-bit, g=128 | 4.25 | ~35.0 GB | Best quality at 4 bits, vLLM native |
| GPTQ 4-bit, g=128, act-order | 4.40 | ~36.1 GB | Slightly larger, broad tooling |
| GGUF Q4_K_M | 4.82 | ~40.5 GB | llama.cpp, mixed precision per tensor |
| GGUF Q4_K_S | 4.58 | ~38.5 GB | Smaller Q4 variant |
| GGUF IQ3_M | 3.66 | ~30.8 GB | Lower quality, fits tighter |
KV cache and activation overhead
Weights are only part of the story. Llama 3.1 70B has 80 layers, 8 KV heads, 128 head-dim, which is 2 bytes FP16 * 2 (K+V) * 80 * 8 * 128 = ~327 KB per token. Multiply by your context length:
| Context length | KV per sequence | KV for 4 concurrent | Extra activation |
|---|---|---|---|
| 4,096 | 1.3 GB | 5.2 GB | ~1 GB |
| 8,192 | 2.6 GB | 10.4 GB | ~1.5 GB |
| 16,384 | 5.2 GB | 20.8 GB | ~2 GB |
| 32,768 | 10.4 GB | 41.6 GB | ~3 GB |
Total minimum for AWQ Llama 3.1 70B with 4 concurrent sequences at 8k context: 35 + 10.4 + 1.5 = ~47 GB. This is why 48 GB is the practical floor.
Which GPUs can host it
| GPU | VRAM | Memory BW | Fits 70B AWQ + 8k ctx? | Notes |
|---|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | 448 GB/s | No | Not even close |
| RTX 3090 24GB | 24 GB | 936 GB/s | No | Weights alone don’t fit |
| RTX 4090 24GB | 24 GB | 1008 GB/s | No | Weights alone don’t fit |
| RTX 5090 32GB | 32 GB | 1.8 TB/s | No | Weights fit, no room for KV |
| 2x RTX 3090 (NVLink) | 48 GB | 936 GB/s each | Yes, 8k ctx | Tensor parallelism needed |
| RTX 6000 Pro 96GB | 96 GB | 1.4 TB/s | Yes, 128k ctx, bs=8+ | Ideal single-card |
| A100 40GB | 40 GB | 1.55 TB/s | Tight at 2k ctx | AWQ only, small batch |
| A100 80GB | 80 GB | 2.0 TB/s | Yes, 32k ctx, bs=4 | Production-grade |
| H100 80GB | 80 GB | 3.35 TB/s | Yes, 32k ctx, bs=8 | Throughput leader |
CPU/disk offload options
llama.cpp can offload layers to system RAM. With a 32-core Epyc and DDR5, a 70B Q4_K_M with 40 layers on GPU and 40 on CPU runs at roughly 3-5 tokens/s per sequence. That is fine for batch summarisation but unusable for chat. Disk offload (DeepSpeed Zero-Infinity) drops to sub-1 t/s and is rarely worth the complexity.
When to upgrade
The honest answer: if you definitively need 70B quality, skip the 16 GB class entirely. The cleanest single-card path is the RTX 6000 Pro 96GB, which holds 70B at AWQ with 128k context and runs it at roughly 35 tokens/s. The budget path is dual RTX 3090 with NVLink, which holds 70B AWQ at 8k but sacrifices newer features like FP8.
Alternatives that fit on 16 GB
In many workloads a well-tuned 8B or 14B beats a clumsily-offloaded 70B on latency and cost. Consider:
- Llama 3.1 8B: 100 t/s FP8 on a 5060 Ti.
- Qwen 2.5 14B: often matches 70B for coding and maths tasks.
- Gemma 2 9B: strong English chat quality.
For the full VRAM reference see 8B LLM VRAM requirements and Qwen 2.5 32B VRAM requirements.
Need 70B? Start with the right card.
48 GB or more, FP8, production-ready. UK dedicated hosting.
Browse dedicated GPU hostingSee also: upgrade to RTX 6000 Pro, upgrade to RTX 5090, 32B VRAM requirements, 8B VRAM requirements, max model size.