Here is a number that should get your attention: 52 tokens per second at full FP16 precision, with no quantisation compromises at all. The RTX 3090 has enough VRAM to load Google’s Gemma 2 9B at native precision and enough bandwidth to keep tokens flowing at production speed. We verified it on GigaGPU dedicated servers.
Benchmark Data
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 52.0 tok/s |
| Tokens/sec (batched, bs=8) | 83.2 tok/s |
| Per-token latency | 19.2 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 16K |
| Performance rating | Very Good |
512-token prompt, 256-token completion, single-stream via llama.cpp or vLLM. Running at FP16 means every nuance of Gemma 2’s training carries through to output — no quantisation artefacts, no degraded instruction following on edge cases.
Memory Layout
| Component | VRAM |
|---|---|
| Model weights (FP16) | 18.9 GB |
| KV cache + runtime | ~2.8 GB |
| Total RTX 3090 VRAM | 24 GB |
| Free headroom | ~5.1 GB |
Gemma 2 9B at FP16 is a substantial model — nearly 19 GB for weights alone. The 3090’s 24 GB VRAM accommodates it with roughly 5 GB to spare, enough for 16K context lengths and modest concurrent serving. This is a tight but well-balanced fit.
Economics
| Cost Metric | Value |
|---|---|
| Server cost | £0.75/hr (£149/mo) |
| Cost per 1M tokens | £4.006 |
| Tokens per £1 | 249,626 |
| Break-even vs API | ~1 req/day |
£4.01 per million tokens at full precision is a strong proposition. Batched at bs=8, you drop to approximately £2.50/M. Compared to hosted Gemma API endpoints, self-hosting on the 3090 saves substantially from the first day of moderate use. Our cost calculator lets you plug in your own volume numbers.
Who This Is For
The RTX 3090 with Gemma 2 9B at FP16 is a production-ready configuration. At 52 tok/s single-stream (83 tok/s batched), it handles real-time chat, document summarisation, and RAG pipelines without perceptible lag. If you need the quality that only unquantised weights deliver — particularly for multilingual or nuanced reasoning tasks — this is the tier to target. Compare against other models in our tok/s benchmark.
Deploy now:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-9b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
See the Gemma hosting guide for vLLM configuration. Also: best GPU for LLM inference, full benchmarks, cheapest GPU for AI.
Gemma 2 9B at Full Precision on the RTX 3090
No quantisation, no compromises. UK datacentre, dedicated hardware, flat rate.
Provision an RTX 3090