RTX 3050 - Order Now
Home / Blog / Benchmarks / Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>
Benchmarks

Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

A counterintuitive result: the RTX 5080 pushes Gemma 2 9B to 48.8 tok/s in 4-bit mode — nearly matching what the RTX 3090 achieves at full FP16. How? The 5080’s newer Blackwell architecture delivers significantly higher memory bandwidth, which compensates for the quantisation overhead. We measured it all on GigaGPU dedicated hardware.

Performance at a Glance

MetricValue
Tokens/sec (single stream)48.8 tok/s
Tokens/sec (batched, bs=8)63.4 tok/s
Per-token latency20.5 ms
PrecisionINT4
Quantisation4-bit GGUF Q4_K_M
Max context length8K
Performance ratingVery Good

512-token prompt, 256-token completion, single-stream, llama.cpp Q4_K_M. While the 5080 has only 16 GB of VRAM (too tight for Gemma 2 9B at FP16), it runs the 4-bit version blazingly fast.

VRAM Distribution

ComponentVRAM
Model weights (4-bit GGUF Q4_K_M)6.4 GB
KV cache + runtime~1.0 GB
Total RTX 5080 VRAM16 GB
Free headroom~9.6 GB

At 4-bit, Gemma 2 9B only occupies about 6.4 GB, leaving 9.6 GB free. That is more headroom than the 3090 has when running FP16. You can extend context to 8K, handle a few concurrent users, or pair Gemma with a secondary lightweight model.

Cost Perspective

Cost MetricValue
Server cost£0.95/hr (£189/mo)
Cost per 1M tokens£5.408
Tokens per £1184,911
Break-even vs API~1 req/day

At £5.41/M single-stream (£3.38/M batched), the 5080 costs more per token than the RTX 3090 (£4.01/M) because the 3090 runs FP16 at higher throughput for a lower monthly rate. The 5080’s advantage is its newer architecture and extra VRAM headroom relative to model size. If you need 4-bit quantisation anyway — perhaps for faster prefill or tighter latency guarantees — the 5080 delivers very well. Compare everything in the tok/s benchmark.

Recommendation

Pick the RTX 5080 for Gemma 2 9B when you want modern hardware, fast 4-bit inference, and generous memory margins. For raw FP16 quality, the RTX 3090 is the better value. For absolute top-end speed, the RTX 5090 leaves both behind.

Quick launch:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

More in the Gemma hosting guide. See also: best GPU for LLM inference, all benchmarks, cost calculator.

Gemma 2 9B on RTX 5080 — Blackwell Speed

48.8 tok/s with nearly 10 GB headroom. UK datacentre, dedicated server, flat pricing.

Configure RTX 5080

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?