The 8B class (Llama 3.1 8B, Mistral 7B v0.3, Qwen 2.5 7B, Gemma 2 9B) is where most self-hosted LLM traffic sits. It is small enough to fit on a 16 GB card like the RTX 5060 Ti and big enough to do real work. This article gives the exact weight math, the KV cache formula, and a GPU compatibility table, all on our UK dedicated GPU hosting.
Contents
- Weight size at each precision
- KV cache formula
- Context length impact
- Which GPUs fit
- Throughput per card
- Picking the right model
Weight size at each precision
A nominal 8B model is 8 billion parameters; Llama 3.1 8B is 8.03B, Mistral 7B is 7.24B, Qwen 2.5 7B is 7.62B, Gemma 2 9B is 9.24B. Weight footprint follows directly from bytes per parameter:
| Model | Params | FP16 | FP8 | AWQ INT4 |
|---|---|---|---|---|
| Llama 3.1 8B | 8.03B | 16.1 GB | 8.1 GB | 5.4 GB |
| Mistral 7B v0.3 | 7.24B | 14.5 GB | 7.3 GB | 4.9 GB |
| Qwen 2.5 7B | 7.62B | 15.2 GB | 7.7 GB | 5.1 GB |
| Gemma 2 9B | 9.24B | 18.4 GB | 9.2 GB | 6.1 GB |
Note: Llama 3.1 8B at FP16 is already 16.1 GB, so it does not fit in a 16 GB card at full precision once activation is counted. FP8 or AWQ is mandatory on the 5060 Ti. Gemma 2 9B is the borderline case that requires FP8.
KV cache formula
The formula is: KV per token = 2 * bytes * num_layers * num_kv_heads * head_dim. The first 2 is for K and V. At FP16, bytes = 2.
| Model | Layers | KV heads (GQA) | Head dim | KV/token (FP16) |
|---|---|---|---|---|
| Llama 3.1 8B | 32 | 8 | 128 | 131 KB |
| Mistral 7B v0.3 | 32 | 8 | 128 | 131 KB |
| Qwen 2.5 7B | 28 | 4 | 128 | 57 KB |
| Gemma 2 9B | 42 | 8 | 128 | 172 KB |
Qwen 2.5 7B is notably KV-efficient thanks to aggressive GQA: 57 KB/token vs 131 KB for Llama 3.1 8B. This means Qwen supports 2.3x the context per GB of KV cache.
Context length impact
Llama 3.1 8B, FP8 weights (8.1 GB). Remaining 7.9 GB on a 16 GB card (after CUDA overhead ~500 MB):
| Context | KV/sequence (FP16) | Concurrent sequences (Llama 3.1 8B) | Concurrent (Qwen 2.5 7B) |
|---|---|---|---|
| 4,096 | 0.52 GB | ~14 | ~32 |
| 8,192 | 1.05 GB | ~7 | ~16 |
| 16,384 | 2.10 GB | ~3 | ~8 |
| 32,768 | 4.20 GB | ~1 | ~4 |
| 65,536 | 8.40 GB | Does not fit | ~2 |
FP8 KV cache (available in vLLM 0.6+) halves these numbers with a < 0.1 MMLU drop. For a deep dive see our context budget article.
Which GPUs fit 8B LLMs
| GPU | VRAM | FP16 + 8k ctx | FP8 + 8k ctx | AWQ + 32k ctx |
|---|---|---|---|---|
| RTX 3050 8GB | 8 GB | No | Tight (bs=1, 2k) | Yes, bs=1 |
| RTX 4060 Ti 16GB | 16 GB | No FP8 support | n/a | Yes |
| RTX 3090 24GB | 24 GB | Yes, bs=2 | n/a (no FP8) | Yes, bs=8 |
| RTX 5060 Ti 16GB | 16 GB | No | Yes, bs=6 | Yes, bs=4 |
| RTX 5090 32GB | 32 GB | Yes, bs=4 | Yes, bs=16 | Yes, bs=32 |
| RTX 6000 Pro 96GB | 96 GB | Yes, bs=32 | Yes, bs=64+ | Yes, bs=128+ |
Throughput per card
Single-stream (bs=1) throughput for Llama 3.1 8B:
| GPU | Precision | Tokens/s (bs=1) | Tokens/s (bs=16) |
|---|---|---|---|
| RTX 5060 Ti 16GB | FP8 | 100 | ~720 |
| RTX 3090 24GB | FP16 | 90 | ~520 |
| RTX 5090 32GB | FP8 | 175 | ~1,800 |
| RTX 6000 Pro 96GB | FP8 | 140 | ~2,100 |
| H100 80GB | FP8 | 210 | ~3,400 |
Picking the right 8B model
- Llama 3.1 8B: safest default, 128k context, strong tool use. See benchmark.
- Qwen 2.5 7B: best KV efficiency, strong maths and coding.
- Mistral 7B v0.3: compact, sliding window, good for low-VRAM work.
- Gemma 2 9B: highest English MMLU in the class. See Gemma 2 guide.
Host 8B LLMs with headroom for real context
FP8 weights, FP8 KV cache, 16 GB GDDR7. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Llama 3 8B benchmark, Gemma 9B benchmark, context budget, 32B VRAM requirements, 6 GB VRAM models.