One hundred and twelve tokens per second. That is faster than most people can read, and it is what happens when you pair Google’s Gemma 2 9B at full FP16 with NVIDIA’s flagship RTX 5090. We tested the ceiling on a GigaGPU dedicated server and the results speak for themselves.
Peak Numbers
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 112.3 tok/s |
| Tokens/sec (batched, bs=8) | 179.7 tok/s |
| Per-token latency | 8.9 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 16K |
| Performance rating | Excellent |
512-token prompt, 256-token completion, single-stream via llama.cpp or vLLM. At sub-9ms per-token latency, responses appear essentially instantaneous to users — there is no perceptible delay between query and streaming output.
Memory Headroom
| Component | VRAM |
|---|---|
| Model weights (FP16) | 18.9 GB |
| KV cache + runtime | ~2.8 GB |
| Total RTX 5090 VRAM | 32 GB |
| Free headroom | ~13.1 GB |
Even at full precision, 13 GB remains unused. That opens up meaningful multi-model deployments: run Gemma alongside a Coqui TTS instance for a complete text-to-speech pipeline, or pair it with a PaddleOCR model for document processing. Alternatively, push context to 16K while serving several concurrent users.
Cost Analysis
| Cost Metric | Value |
|---|---|
| Server cost | £1.50/hr (£299/mo) |
| Cost per 1M tokens | £3.710 |
| Tokens per £1 | 269,542 |
| Break-even vs API | ~1 req/day |
Despite the £299/mo sticker, the 5090’s sheer throughput drives per-token cost to £3.71/M single-stream and about £2.32/M batched. That is competitive with the RTX 3090 (£4.01/M), and you get more than double the speed plus 8 GB of additional VRAM. For high-volume deployments, the 5090’s per-token economics actually win. Model your scenario with the cost calculator.
The Bottom Line
If Gemma 2 9B is central to your production stack and you value both quality (FP16) and speed (112+ tok/s), the RTX 5090 is the best card in the GigaGPU lineup for this model. Teams that do not need this level of throughput should look at the RTX 3090, which delivers excellent FP16 performance at a lower monthly cost.
One command to go:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
Configuration guide: Gemma hosting. Further reading: best GPU for LLM inference, benchmark archive, tok/s tool.
112 tok/s Gemma 2 9B — RTX 5090 Servers
Flagship speed, FP16 quality, flat monthly rate. UK datacentre with full root access.
Build Your 5090 Server