Gemma 9B (INT4) on RTX 3090: Monthly Cost & Token Output
Dedicated RTX 3090 hosting for Gemma 9B (INT4) (9B INT4) inference — fixed monthly pricing with unlimited tokens.
Monthly Cost Summary
19 GB of free VRAM and 110 tok/s. By quantising Gemma 9B to INT4, the RTX 3090 becomes an incredibly versatile inference server. You get 285 million tokens monthly for £89, and the massive VRAM headroom supports aggressive batching or even co-hosting additional models.
| Metric | Value |
|---|---|
| GPU | RTX 3090 (24 GB VRAM) |
| Model | Gemma 9B (INT4) (9B INT4 parameters) |
| Monthly Server Cost | £89/mo |
| Tokens/Second | ~110.0 tok/s |
| Tokens/Day (24h) | ~9,504,000 |
| Tokens/Month | ~285,120,000 |
| Effective Cost per 1M Tokens | £0.3121 |
Maximum Flexibility with Quantised Inference
INT4 quantisation frees up VRAM that translates directly into higher concurrent capacity:
| Provider | Cost per 1M Tokens | GigaGPU Savings |
|---|---|---|
| GigaGPU (RTX 3090) | £0.3121 | — |
| Together.ai | $0.20 | Comparable |
| Fireworks | $0.20 | Comparable |
| Google Vertex | $0.30 | Comparable |
Break-Even Analysis
Against Together.ai at $0.20/1M tokens, break-even sits at roughly 445M tokens/month. With 19 GB of free VRAM, the 3090 can batch requests at a scale that pushes practical throughput significantly beyond the single-stream figure.
Hardware & Configuration Notes
19 GB of spare VRAM is remarkable for a 9B-parameter model. Consider running Gemma 9B alongside an embedding model for RAG, or a second inference model for different query types.
- VRAM usage: Gemma 9B (INT4) requires approximately 5 GB VRAM. The RTX 3090 provides 24 GB, leaving 19 GB headroom for KV cache and batching.
- Quantisation: INT4 quantisation reduces Gemma 9B from ~9 GB to ~5 GB VRAM, leaving 19 GB free on the 3090 for maximum batching capacity.
- Batching: With continuous batching enabled (e.g., vLLM or TGI), you can serve multiple concurrent users from a single GPU, increasing effective throughput significantly.
- Scaling: Need more throughput? Add additional RTX 3090 nodes behind a load balancer. GigaGPU supports multi-server deployments with simple configuration.
Best Use Cases for Gemma 9B (INT4) on RTX 3090
- High-concurrency production deployments
- Multi-model GPU setups for diverse AI workloads
- Large-context document processing and analysis
- Cost-efficient 9B-class inference at scale
- Flexible development and research environments
285M Tokens/Month, 19 GB Spare VRAM
Deploy quantised Gemma 9B on an RTX 3090 for maximum flexibility at £89/month.