Gemma 2 27B is Google’s mid-size open-weights model, well-regarded for factual reasoning and instruction following. It fits a single RTX 5090 32GB from our dedicated GPU hosting at FP8 or INT4 with real serving headroom.
Contents
Memory Fit
| Precision | Weights | Notes |
|---|---|---|
| FP16 | ~54 GB | Does not fit 32 GB |
| FP8 | ~27 GB | Tight but works |
| AWQ INT4 | ~16 GB | Comfortable with batching |
AWQ INT4 leaves ~14-15 GB of KV cache on a 32 GB card – enough for 20+ concurrent 8k sequences.
Launch
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-2-27b-it-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
Gemma 2’s native context is 8192. Do not push it higher – quality degrades past that window. If you need longer context, pick Mistral Nemo 12B instead.
Performance
| Workload | Throughput |
|---|---|
| AWQ INT4, batch 1 | ~55 t/s |
| AWQ INT4, batch 8 | ~320 t/s aggregate |
| AWQ INT4, batch 24 | ~560 t/s aggregate |
Prompting
Gemma has a specific chat template with <start_of_turn> markers. vLLM uses the tokenizer’s built-in template when you send OpenAI-format messages. For manual completion calls, wrap prompts as:
<start_of_turn>user
Your prompt here<end_of_turn>
<start_of_turn>model
Gemma refuses more aggressively than Llama on some topics. If you need less restrictive behaviour consider Mistral Small 3.
Gemma 2 27B on a Single 5090
UK dedicated hosting with vLLM and Gemma preconfigured.
Browse GPU ServersSee Gemma 2 9B for the smaller variant benchmark.