Home / Blog / Benchmarks / Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Benchmarks

Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

Benchmarks April 15, 2026 2 min read admin

A counterintuitive result: the RTX 5080 pushes Gemma 2 9B to 48.8 tok/s in 4-bit mode — nearly matching what the RTX 3090 achieves at full FP16. How? The 5080’s newer Blackwell architecture delivers significantly higher memory bandwidth, which compensates for the quantisation overhead. We measured it all on GigaGPU dedicated hardware.

Performance at a Glance

Metric	Value
Tokens/sec (single stream)	48.8 tok/s
Tokens/sec (batched, bs=8)	63.4 tok/s
Per-token latency	20.5 ms
Precision	INT4
Quantisation	4-bit GGUF Q4_K_M
Max context length	8K
Performance rating	Very Good

512-token prompt, 256-token completion, single-stream, llama.cpp Q4_K_M. While the 5080 has only 16 GB of VRAM (too tight for Gemma 2 9B at FP16), it runs the 4-bit version blazingly fast.

VRAM Distribution

Component	VRAM
Model weights (4-bit GGUF Q4_K_M)	6.4 GB
KV cache + runtime	~1.0 GB
Total RTX 5080 VRAM	16 GB
Free headroom	~9.6 GB

At 4-bit, Gemma 2 9B only occupies about 6.4 GB, leaving 9.6 GB free. That is more headroom than the 3090 has when running FP16. You can extend context to 8K, handle a few concurrent users, or pair Gemma with a secondary lightweight model.

Cost Perspective

Cost Metric	Value
Server cost	£0.95/hr (£189/mo)
Cost per 1M tokens	£5.408
Tokens per £1	184,911
Break-even vs API	~1 req/day

At £5.41/M single-stream (£3.38/M batched), the 5080 costs more per token than the RTX 3090 (£4.01/M) because the 3090 runs FP16 at higher throughput for a lower monthly rate. The 5080’s advantage is its newer architecture and extra VRAM headroom relative to model size. If you need 4-bit quantisation anyway — perhaps for faster prefill or tighter latency guarantees — the 5080 delivers very well. Compare everything in the tok/s benchmark.

Recommendation

Pick the RTX 5080 for Gemma 2 9B when you want modern hardware, fast 4-bit inference, and generous memory margins. For raw FP16 quality, the RTX 3090 is the better value. For absolute top-end speed, the RTX 5090 leaves both behind.

Quick launch:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

More in the Gemma hosting guide. See also: best GPU for LLM inference, all benchmarks, cost calculator.

Gemma 2 9B on RTX 5080 — Blackwell Speed

48.8 tok/s with nearly 10 GB headroom. UK datacentre, dedicated server, flat pricing.

Configure RTX 5080

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Performance at a Glance

VRAM Distribution

Cost Perspective

Recommendation

Gemma 2 9B on RTX 5080 — Blackwell Speed

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Performance at a Glance

VRAM Distribution

Cost Perspective

Recommendation

Gemma 2 9B on RTX 5080 — Blackwell Speed

Need a Dedicated GPU Server?

admin

Related Articles

Mistral 7B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-3050-benchmark, Excerpt: Mistral 7B benchmarked on RTX 3050: 10.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Quantized vs Full Precision: Quality Loss

Qwen 2.5 7B on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-4060-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 4060: 21.4 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

DeepSeek 7B on RTX 4060 Ti: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-4060-ti-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 4060 Ti: 32.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?