Google’s Gemma 2 family sits in a useful niche: Apache-adjacent licence, strong English quality, and a 9B model that is competitive with Llama 3.1 8B. On the RTX 5060 Ti 16GB, the 2B and 9B variants fit cleanly at FP8, and the 27B fits only at AWQ INT4 with a tight context. This guide gives VRAM, throughput, and the sliding-window attention caveat that catches out new deployments on our UK dedicated GPU hosting.
Contents
- The Gemma 2 family
- VRAM footprint
- Throughput numbers
- Sliding-window attention caveat
- Use cases and sizing
- Deployment recipe
The Gemma 2 family
Gemma 2 was released in June 2024 and uses a hybrid attention pattern (alternating sliding-window and global layers). Three model sizes are currently shipping:
| Model | Params | Context | Attention | MMLU (5-shot) |
|---|---|---|---|---|
| Gemma 2 2B | 2.6B | 8,192 | Local+global | 51.3 |
| Gemma 2 9B | 9.2B | 8,192 | Local+global | 71.3 |
| Gemma 2 27B | 27.2B | 8,192 | Local+global | 75.2 |
Head-to-head the 9B beats Llama 3 8B by ~3 points on MMLU and is the strongest sub-10B English model at time of writing.
VRAM footprint
All three fit on the 5060 Ti at some quantisation. Only 2B and 9B fit comfortably at FP8.
| Model | FP16 | FP8 | AWQ INT4 | Fits 16 GB? |
|---|---|---|---|---|
| Gemma 2 2B | 5.2 GB | 2.6 GB | 1.8 GB | Yes, 8k context, large batch |
| Gemma 2 9B | 18.4 GB | 9.2 GB | 6.1 GB | FP8: yes, 8k context. FP16: no. |
| Gemma 2 27B | 54 GB | 27 GB | 16.0 GB | AWQ: tight, 2k context. FP8: no. |
27B at AWQ leaves roughly zero margin on a 16 GB card, so in practice the 9B is the right target for the 5060 Ti. For 27B comfortably you want at least an RTX 5090 32GB.
Throughput numbers
Measured on vLLM 0.6, 2k output tokens, Blackwell native FP8:
| Model | Precision | Tokens/s (bs=1) | Tokens/s (bs=8) | Concurrent chat users* |
|---|---|---|---|---|
| Gemma 2 2B | FP16 | ~210 | ~1,400 | 50+ |
| Gemma 2 9B | FP8 | ~98 | ~520 | ~20 |
| Gemma 2 9B | AWQ INT4 | ~130 | ~600 | ~25 |
| Gemma 2 27B | AWQ INT4 | ~32 | ~95 | ~6 |
*assuming 4 tokens/s per user as the target interactive rate.
For a direct model-to-model comparison see the Gemma 9B benchmark and the Llama 3 8B benchmark.
Sliding-window attention caveat
Gemma 2 interleaves 4,096-token sliding-window layers with full-attention layers. In practice this means the model’s usable retrieval over long context drops sharply past about 4k tokens, even though the nominal window is 8k. For retrieval-augmented workloads with 8k chunks, Qwen 2.5 7B or Llama 3.1 8B generally retrieves better. Check our context budget article before picking a model for long-context use.
Use cases and sizing
- Gemma 2 2B: routing models, classification at scale, cheap fallback for RAG, summarisation of short docs.
- Gemma 2 9B: the sweet spot on 16 GB. Chat, summarisation, structured extraction up to 4k tokens effective.
- Gemma 2 27B: quality-sensitive workloads, but only viable on the 5060 Ti as a research load. Production 27B deployments should use the RTX 5090 32GB or the RTX 6000 Pro 96GB.
Deployment recipe
Serve with vLLM 0.6+ (required for correct Gemma 2 sliding-window handling) or TGI 2.2+. Use --dtype fp8, --max-model-len 8192, --max-num-seqs 32. For maximum-model sizing see our max model size reference.
Run Gemma 2 9B at 98 tokens/s
Blackwell FP8, 16 GB GDDR7, native vLLM support. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Gemma 9B benchmark, Llama 3 8B benchmark, 8B VRAM requirements, context budget, max model size.