Home / Blog / Benchmarks / Gemma 2 9B Tokens/sec by GPU

Benchmarks

Gemma 2 9B Tokens/sec by GPU

Benchmark results for Google Gemma 2 9B inference speed across six GPUs at FP16, INT8, and INT4 precision, with cost-efficiency analysis for dedicated server hosting.

Benchmarks April 14, 2026 3 min read admin

Table of Contents

Gemma 2 9B Benchmark Overview
Tokens/sec Results by GPU
Quantisation Comparison
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Gemma 2 9B Benchmark Overview

Google’s Gemma 2 9B is a 9-billion-parameter open model that delivers strong performance on reasoning, summarisation, and multilingual tasks. With roughly 18 GB of VRAM required at FP16, it sits at a sweet spot where dedicated GPU servers with mid-range cards can still run it comfortably. We benchmark inference speed across six GPUs to help you choose the right hardware.

Testing was conducted on GigaGPU dedicated servers using vLLM with a 512-token input prompt and 256-token output. Gemma 2 9B at FP16 requires approximately 18 GB of VRAM, so cards with 16 GB need quantisation. For full methodology details, see our tokens per second benchmark hub.

Tokens/sec Results by GPU

GPU	VRAM	Gemma 2 9B FP16 (tok/s)	Notes
RTX 3050	6 GB	N/A	Insufficient VRAM
RTX 4060	8 GB	N/A	Insufficient VRAM for FP16
RTX 4060 Ti	16 GB	N/A	Tight fit; unstable at FP16
RTX 3090	24 GB	36 tok/s	Comfortable fit
RTX 5080	16 GB	N/A	Needs INT8 or INT4
RTX 5090	32 GB	78 tok/s	Plenty of headroom

At FP16, only the RTX 3090 and RTX 5090 have enough VRAM to run Gemma 2 9B reliably. For 16 GB cards, quantisation is essential. Check whether your target GPU is viable with our Can RTX 5080 run Gemma? guide.

Quantisation Comparison

Quantisation opens Gemma 2 9B to a wider range of GPUs. Below we compare FP16, INT8, and INT4 performance. For a detailed analysis of how quantisation affects quality and speed, see our FP16 vs INT8 vs INT4 comparison.

GPU	FP16 (tok/s)	INT8 (tok/s)	INT4 (tok/s)
RTX 3050 (6 GB)	N/A	N/A	7 tok/s
RTX 4060 (8 GB)	N/A	14 tok/s	18 tok/s
RTX 4060 Ti (16 GB)	N/A	26 tok/s	34 tok/s
RTX 3090 (24 GB)	36	44 tok/s	52 tok/s
RTX 5080 (16 GB)	N/A	52 tok/s	65 tok/s
RTX 5090 (32 GB)	78	92 tok/s	108 tok/s

INT4 quantisation makes the RTX 4060 Ti a viable option at 34 tok/s, and even the RTX 4060 can manage 18 tok/s. The RTX 5080 at INT4 delivers 65 tok/s, making it competitive with the RTX 3090 at FP16.

Cost Efficiency Analysis

Comparing cost efficiency using INT4 performance, which is the practical configuration for most GPUs in this range. Prices are based on GigaGPU dedicated hosting rates.

GPU	INT4 tok/s	Approx. Monthly Cost	tok/s per Pound
RTX 4060	18	~£60	0.30
RTX 4060 Ti	34	~£75	0.45
RTX 3090	52	~£110	0.47
RTX 5080	65	~£160	0.41
RTX 5090	108	~£250	0.43

The RTX 3090 offers the best value per pound at INT4, followed closely by the RTX 4060 Ti. For a thorough comparison, see our best GPU for Gemma guide.

GPU Recommendations

Best budget: RTX 4060 Ti at INT4 — 34 tok/s for ~£75/month is solid for development APIs.
Best value: RTX 3090 — strong FP16 performance and top cost efficiency at INT4.
Best performance: RTX 5090 — 78 tok/s FP16 or 108 tok/s INT4 for production workloads.

Compare these results with the Gemma 2 27B benchmark for the larger variant, or check the Phi-3 Mini results for a smaller alternative. All data is available in the Benchmarks category.

Conclusion

Gemma 2 9B offers a compelling balance of capability and size. With INT4 quantisation, it runs well on mid-range GPUs, while the RTX 3090 and RTX 5090 can handle it at full FP16 precision. Whether you are building a summarisation pipeline or a multilingual chatbot, Gemma 2 9B is a practical choice for dedicated GPU deployments.

Host Gemma 2 9B on Dedicated Hardware

Get bare-metal GPU servers optimised for LLM inference with full root access and fast NVMe storage.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Gemma 2 9B Tokens/sec by GPU

Gemma 2 9B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Host Gemma 2 9B on Dedicated Hardware

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Gemma 2 9B Tokens/sec by GPU

Gemma 2 9B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Host Gemma 2 9B on Dedicated Hardware

Need a Dedicated GPU Server?

admin

Related Articles

LLaMA 3 8B Tokens/sec by GPU (Full Benchmark)

DeepSeek 7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-5080-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 5080: 68.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Mistral 7B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-3050-benchmark, Excerpt: Mistral 7B benchmarked on RTX 3050: 10.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Flux.1 on RTX 3090: Images/sec & VRAM Usage, Category: Benchmarks, Slug: flux-1-on-rtx-3090-benchmark, Excerpt: Flux.1 benchmarked on RTX 3090: 0.82 it/s, 2.46 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?