Mixtral 8x7B (INT4) on RTX 4060 Ti: Monthly Cost & Token Output
Dedicated RTX 4060 Ti hosting for Mixtral 8x7B (INT4) (46.7B INT4) inference — fixed monthly pricing with unlimited tokens.
Monthly Cost Summary
Mixtral 8x7B on a £69/month GPU — INT4 quantisation makes it possible. By compressing the 46.7B-parameter mixture-of-experts model to ~14 GB, it fits on the RTX 4060 Ti with 2 GB to spare. At 33.8 tok/s, throughput is moderate but sufficient for production use cases where GPT-3.5-class quality matters more than raw speed.
| Metric | Value |
|---|---|
| GPU | RTX 4060 Ti (16 GB VRAM) |
| Model | Mixtral 8x7B (INT4) (46.7B INT4 parameters) |
| Monthly Server Cost | £69/mo |
| Tokens/Second | ~33.8 tok/s |
| Tokens/Day (24h) | ~2,920,320 |
| Tokens/Month | ~87,609,600 |
| Effective Cost per 1M Tokens | £0.7876 |
Mixture-of-Experts Quality at Entry-Level Pricing
Mixtral’s MoE architecture activates only ~13B parameters per forward pass, keeping inference efficient even under quantisation:
| Provider | Cost per 1M Tokens | GigaGPU Savings |
|---|---|---|
| GigaGPU (RTX 4060 Ti) | £0.7876 | — |
| Together.ai | $0.60 | Comparable |
| Fireworks | $0.50 | Comparable |
| Groq | $0.24 | Comparable |
Break-Even Analysis
Compared to Groq at $0.24/1M tokens, break-even is approximately 287.5M tokens/month. Mixtral’s efficient expert routing means INT4 quantisation has less impact on output quality than you might expect from a model of this size.
Hardware & Configuration Notes
INT4 compression brings Mixtral 8x7B from 26 GB down to ~14 GB, leaving 2 GB on the 4060 Ti. VRAM is tight, so this setup works best with shorter context lengths and moderate batch sizes.
- VRAM usage: Mixtral 8x7B (INT4) requires approximately 14 GB VRAM. The RTX 4060 Ti provides 16 GB, leaving 2 GB headroom for KV cache and batching.
- Quantisation: INT4 quantisation reduces Mixtral from 26 GB to ~14 GB. Expert routing preserves quality well under quantisation.
- Batching: With continuous batching enabled (e.g., vLLM or TGI), you can serve multiple concurrent users from a single GPU, increasing effective throughput significantly.
- Scaling: Need more throughput? Add additional RTX 4060 Ti nodes behind a load balancer. GigaGPU supports multi-server deployments with simple configuration.
Best Use Cases for Mixtral 8x7B (INT4) on RTX 4060 Ti
- Budget-friendly access to GPT-3.5-class quality
- Production chatbots that prioritise reasoning quality
- Code generation with strong instruction-following on a budget
- Small-team AI assistants needing advanced capabilities
- A/B testing Mixtral quality against smaller models
Mixtral 8x7B for Just £69/Month
Run the full mixture-of-experts model on a dedicated RTX 4060 Ti. INT4 quantised, flat pricing, no API dependency.