Gemma 2 9B is Google’s open-weights mid-size model. On the RTX 5060 Ti 16GB at our hosting it is a comfortable FP8 fit with good production performance.
Contents
Fit
- FP16 / BF16: ~18 GB – does not fit with KV cache
- FP8: ~9 GB – comfortable
- AWQ INT4: ~5.5 GB – very comfortable
Deployment
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-2-9b-it \
--dtype bfloat16 \
--quantization fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
Gemma 2’s native context is 8192. Do not push higher – quality degrades. For long-context workloads pick Mistral Nemo 12B.
Performance
- FP8 batch 1 decode: ~78 t/s
- AWQ batch 1 decode: ~95 t/s
- FP8 batch 16 aggregate: ~480 t/s
- TTFT 1k prompt (FP8): ~220 ms
Chat Template
Gemma has a specific template with <start_of_turn> role markers. vLLM auto-detects from tokeniser config when you send OpenAI-format messages – no manual template needed.
When to Pick Gemma
Gemma 2 9B is strong on:
- Factual Q&A
- Summarisation
- Following strict safety constraints (aligned conservatively)
Weaker on:
- Creative generation (more restricted)
- Edge-case topics (refuses more aggressively than Mistral/Llama)
- Long context
For less-restrictive responses consider Mistral 7B.
Gemma 2 9B Hosting
Google’s open model on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Gemma 9B benchmark, monthly cost.