A counterintuitive result: the RTX 5080 pushes Gemma 2 9B to 48.8 tok/s in 4-bit mode — nearly matching what the RTX 3090 achieves at full FP16. How? The 5080’s newer Blackwell architecture delivers significantly higher memory bandwidth, which compensates for the quantisation overhead. We measured it all on GigaGPU dedicated hardware.
Performance at a Glance
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 48.8 tok/s |
| Tokens/sec (batched, bs=8) | 63.4 tok/s |
| Per-token latency | 20.5 ms |
| Precision | INT4 |
| Quantisation | 4-bit GGUF Q4_K_M |
| Max context length | 8K |
| Performance rating | Very Good |
512-token prompt, 256-token completion, single-stream, llama.cpp Q4_K_M. While the 5080 has only 16 GB of VRAM (too tight for Gemma 2 9B at FP16), it runs the 4-bit version blazingly fast.
VRAM Distribution
| Component | VRAM |
|---|---|
| Model weights (4-bit GGUF Q4_K_M) | 6.4 GB |
| KV cache + runtime | ~1.0 GB |
| Total RTX 5080 VRAM | 16 GB |
| Free headroom | ~9.6 GB |
At 4-bit, Gemma 2 9B only occupies about 6.4 GB, leaving 9.6 GB free. That is more headroom than the 3090 has when running FP16. You can extend context to 8K, handle a few concurrent users, or pair Gemma with a secondary lightweight model.
Cost Perspective
| Cost Metric | Value |
|---|---|
| Server cost | £0.95/hr (£189/mo) |
| Cost per 1M tokens | £5.408 |
| Tokens per £1 | 184,911 |
| Break-even vs API | ~1 req/day |
At £5.41/M single-stream (£3.38/M batched), the 5080 costs more per token than the RTX 3090 (£4.01/M) because the 3090 runs FP16 at higher throughput for a lower monthly rate. The 5080’s advantage is its newer architecture and extra VRAM headroom relative to model size. If you need 4-bit quantisation anyway — perhaps for faster prefill or tighter latency guarantees — the 5080 delivers very well. Compare everything in the tok/s benchmark.
Recommendation
Pick the RTX 5080 for Gemma 2 9B when you want modern hardware, fast 4-bit inference, and generous memory margins. For raw FP16 quality, the RTX 3090 is the better value. For absolute top-end speed, the RTX 5090 leaves both behind.
Quick launch:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
More in the Gemma hosting guide. See also: best GPU for LLM inference, all benchmarks, cost calculator.
Gemma 2 9B on RTX 5080 — Blackwell Speed
48.8 tok/s with nearly 10 GB headroom. UK datacentre, dedicated server, flat pricing.
Configure RTX 5080