Home / Blog / Benchmarks / Gemma 2 9B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5090-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5090: 112.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Benchmarks

Gemma 2 9B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5090-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5090: 112.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Gemma 2 9B benchmarked on RTX 5090: 112.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration.

Benchmarks April 15, 2026 2 min read gigagpu

One hundred and twelve tokens per second. That is faster than most people can read, and it is what happens when you pair Google’s Gemma 2 9B at full FP16 with NVIDIA’s flagship RTX 5090. We tested the ceiling on a GigaGPU dedicated server and the results speak for themselves.

Peak Numbers

Metric	Value
Tokens/sec (single stream)	112.3 tok/s
Tokens/sec (batched, bs=8)	179.7 tok/s
Per-token latency	8.9 ms
Precision	FP16
Quantisation	FP16
Max context length	16K
Performance rating	Excellent

512-token prompt, 256-token completion, single-stream via llama.cpp or vLLM. At sub-9ms per-token latency, responses appear essentially instantaneous to users — there is no perceptible delay between query and streaming output.

Memory Headroom

Component	VRAM
Model weights (FP16)	18.9 GB
KV cache + runtime	~2.8 GB
Total RTX 5090 VRAM	32 GB
Free headroom	~13.1 GB

Even at full precision, 13 GB remains unused. That opens up meaningful multi-model deployments: run Gemma alongside a Coqui TTS instance for a complete text-to-speech pipeline, or pair it with a PaddleOCR model for document processing. Alternatively, push context to 16K while serving several concurrent users.

Cost Analysis

Cost Metric	Value
Server cost	£1.50/hr (£299/mo)
Cost per 1M tokens	£3.710
Tokens per £1	269,542
Break-even vs API	~1 req/day

Despite the £299/mo sticker, the 5090’s sheer throughput drives per-token cost to £3.71/M single-stream and about £2.32/M batched. That is competitive with the RTX 3090 (£4.01/M), and you get more than double the speed plus 8 GB of additional VRAM. For high-volume deployments, the 5090’s per-token economics actually win. Model your scenario with the cost calculator.

The Bottom Line

If Gemma 2 9B is central to your production stack and you value both quality (FP16) and speed (112+ tok/s), the RTX 5090 is the best card in the GigaGPU lineup for this model. Teams that do not need this level of throughput should look at the RTX 3090, which delivers excellent FP16 performance at a lower monthly cost.

One command to go:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

Configuration guide: Gemma hosting. Further reading: best GPU for LLM inference, benchmark archive, tok/s tool.

112 tok/s Gemma 2 9B — RTX 5090 Servers

Flagship speed, FP16 quality, flat monthly rate. UK datacentre with full root access.

Build Your 5090 Server

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Gemma 2 9B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5090-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5090: 112.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Peak Numbers

Memory Headroom

Cost Analysis

The Bottom Line

112 tok/s Gemma 2 9B — RTX 5090 Servers

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Gemma 2 9B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5090-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5090: 112.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Peak Numbers

Memory Headroom

Cost Analysis

The Bottom Line

112 tok/s Gemma 2 9B — RTX 5090 Servers

Need a Dedicated GPU Server?

gigagpu

Related Articles

Embedding Throughput Benchmark Across the GigaGPU Lineup

YOLO + LLM Pipeline on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: yolo-llm-pipeline-on-rtx-3090-benchmark, Excerpt: YOLOv8 + LLaMA 3 8B concurrent pipeline benchmarked on RTX 3090: detection FPS, LLM tokens/sec, VRAM breakdown, and cost analysis., Internal links: 9 –>

SD 1.5 on RTX 5080: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sd-1.5-on-rtx-5080-benchmark, Excerpt: SD 1.5 benchmarked on RTX 5080: 18.2 it/s, 43.68 images/min at 512×512, VRAM usage, and cost per 1K images., Internal links: 8 –>

Tensor Cores Explained

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?