RTX 3050 - Order Now
Home / Blog / Benchmarks / Gemma 2 9B Tokens/sec by GPU
Benchmarks

Gemma 2 9B Tokens/sec by GPU

Benchmark results for Google Gemma 2 9B inference speed across six GPUs at FP16, INT8, and INT4 precision, with cost-efficiency analysis for dedicated server hosting.

Gemma 2 9B Benchmark Overview

Google’s Gemma 2 9B is a 9-billion-parameter open model that delivers strong performance on reasoning, summarisation, and multilingual tasks. With roughly 18 GB of VRAM required at FP16, it sits at a sweet spot where dedicated GPU servers with mid-range cards can still run it comfortably. We benchmark inference speed across six GPUs to help you choose the right hardware.

Testing was conducted on GigaGPU dedicated servers using vLLM with a 512-token input prompt and 256-token output. Gemma 2 9B at FP16 requires approximately 18 GB of VRAM, so cards with 16 GB need quantisation. For full methodology details, see our tokens per second benchmark hub.

Tokens/sec Results by GPU

GPUVRAMGemma 2 9B FP16 (tok/s)Notes
RTX 30506 GBN/AInsufficient VRAM
RTX 40608 GBN/AInsufficient VRAM for FP16
RTX 4060 Ti16 GBN/ATight fit; unstable at FP16
RTX 309024 GB36 tok/sComfortable fit
RTX 508016 GBN/ANeeds INT8 or INT4
RTX 509032 GB78 tok/sPlenty of headroom

At FP16, only the RTX 3090 and RTX 5090 have enough VRAM to run Gemma 2 9B reliably. For 16 GB cards, quantisation is essential. Check whether your target GPU is viable with our Can RTX 5080 run Gemma? guide.

Quantisation Comparison

Quantisation opens Gemma 2 9B to a wider range of GPUs. Below we compare FP16, INT8, and INT4 performance. For a detailed analysis of how quantisation affects quality and speed, see our FP16 vs INT8 vs INT4 comparison.

GPUFP16 (tok/s)INT8 (tok/s)INT4 (tok/s)
RTX 3050 (6 GB)N/AN/A7 tok/s
RTX 4060 (8 GB)N/A14 tok/s18 tok/s
RTX 4060 Ti (16 GB)N/A26 tok/s34 tok/s
RTX 3090 (24 GB)3644 tok/s52 tok/s
RTX 5080 (16 GB)N/A52 tok/s65 tok/s
RTX 5090 (32 GB)7892 tok/s108 tok/s

INT4 quantisation makes the RTX 4060 Ti a viable option at 34 tok/s, and even the RTX 4060 can manage 18 tok/s. The RTX 5080 at INT4 delivers 65 tok/s, making it competitive with the RTX 3090 at FP16.

Cost Efficiency Analysis

Comparing cost efficiency using INT4 performance, which is the practical configuration for most GPUs in this range. Prices are based on GigaGPU dedicated hosting rates.

GPUINT4 tok/sApprox. Monthly Costtok/s per Pound
RTX 406018~£600.30
RTX 4060 Ti34~£750.45
RTX 309052~£1100.47
RTX 508065~£1600.41
RTX 5090108~£2500.43

The RTX 3090 offers the best value per pound at INT4, followed closely by the RTX 4060 Ti. For a thorough comparison, see our best GPU for Gemma guide.

GPU Recommendations

  • Best budget: RTX 4060 Ti at INT4 — 34 tok/s for ~£75/month is solid for development APIs.
  • Best value: RTX 3090 — strong FP16 performance and top cost efficiency at INT4.
  • Best performance: RTX 5090 — 78 tok/s FP16 or 108 tok/s INT4 for production workloads.

Compare these results with the Gemma 2 27B benchmark for the larger variant, or check the Phi-3 Mini results for a smaller alternative. All data is available in the Benchmarks category.

Conclusion

Gemma 2 9B offers a compelling balance of capability and size. With INT4 quantisation, it runs well on mid-range GPUs, while the RTX 3090 and RTX 5090 can handle it at full FP16 precision. Whether you are building a summarisation pipeline or a multilingual chatbot, Gemma 2 9B is a practical choice for dedicated GPU deployments.

Host Gemma 2 9B on Dedicated Hardware

Get bare-metal GPU servers optimised for LLM inference with full root access and fast NVMe storage.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?