Gemma 2 9B from Google fits comfortably on the RTX 5060 Ti 16GB at our hosting. The full measured numbers:
Contents
Setup
- Model: google/gemma-2-9b-it
- 42 layers, 8 KV heads, 256 head dim, sliding-window attention
- Native context: 8,192 tokens
- vLLM 0.6.4, FA 2.6
Decode Throughput
| Precision | Weights | t/s (batch 1) |
|---|---|---|
| FP16 | 18 GB | Does not fit |
| FP8 | 9.5 GB | 94 |
| FP8 + FP8 KV | 9.5 GB | 98 |
| AWQ INT4 | 6.2 GB | 115 |
| GGUF Q4_K_M | 5.4 GB | 82 |
| EXL2 4.0 bpw | 5.8 GB | 120 |
Gemma 2 9B is slower at the same precision than Llama 3 8B – head dim is 256 instead of 128 so more FLOPs per token, and weights are larger.
Prefill Throughput
- FP8: 5,400 t/s
- AWQ INT4: 3,600 t/s
- GGUF Q4_K_M: 2,800 t/s
- EXL2 4.0 bpw: 4,100 t/s
Concurrency
FP8 + FP8 KV, 256 in / 512 out:
| Users | Total t/s | Per user |
|---|---|---|
| 1 | 98 | 98 |
| 4 | 305 | 76 |
| 8 | 430 | 54 |
| 16 | 510 | 32 |
Context Note
Gemma 2’s native context is only 8k. Sliding-window attention in alternate layers means effective receptive field is 4k. For long-document use cases pick Llama 3 8B or Qwen 2.5 14B instead. For general chat or summarisation of short texts, Gemma 2 9B holds its own – strong MMLU, particularly good at multi-turn dialogue.
Gemma 2 9B on Blackwell 16GB
~100 t/s decode, Google instruction-tuned. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost, Gemma 2 guide, FP8 deployment, AWQ guide, EXL2 guide.