RTX 3050 - Order Now
Home / Blog / Cost & Pricing / Cost Per 1M Tokens for Mistral 7B Self-Hosted: Every GPU, Every Precision
Cost & Pricing

Cost Per 1M Tokens for Mistral 7B Self-Hosted: Every GPU, Every Precision

Exactly how much you pay per million Mistral 7B tokens on each GPU we host, at FP16 and FP8. Compared to OpenAI, Together AI and the rest of the hosted-API market.

The cleanest way to compare GPU servers for LLM inference is cost per million tokens at full utilisation. It rolls hardware price, throughput, precision support, and the inevitable inefficiencies of real workloads into one number you can put on a spreadsheet. This page is the consolidated cost-per-token table for Mistral 7B Instruct, the most-deployed open-weight LLM in our customer base.

TL;DR

RTX 5090 at FP8 is the cost leader at £0.12/1M tokens at 60% utilisation. The 5080 is essentially tied at £0.11. The 3090 at FP16 is £0.16. Compared to hosted APIs (OpenAI gpt-4o-mini at £0.60/1M output, Together Mistral 7B at £0.16/1M), self-hosting is roughly 4–5× cheaper at meaningful volumes.

Methodology

The formula:

cost_per_1M = (monthly_GBP / (aggregate_tok_per_sec * 86400 * 30 * utilisation)) * 1_000_000

Where:

  • monthly_GBP = the GigaGPU monthly rental price for that card
  • aggregate_tok_per_sec = vLLM 0.6.3 aggregate throughput at 50 concurrent users (from our Mistral benchmarks)
  • utilisation = the fraction of the month the GPU is actually serving traffic at full throughput

I use 60% utilisation as the headline number — that is a realistic steady-state for a busy production deployment with traffic spread across the day. At 100% utilisation (batch jobs running 24/7) the numbers below scale down by 0.6×.

Cost per 1M tokens — Mistral 7B FP16

GPUMonthlyAggregate tok/s (FP16)Tokens/month at 60% utilCost per 1M
RTX 5060 Ti 16 GB£119580900M£0.19
RTX 5080£1898201.27B£0.18
RTX 3090£1597201.12B£0.16
RTX 4090£2899501.48B£0.19
RTX 5090£3991,1801.83B£0.20
RTX 6000 Pro 96 GB£8991,1401.77B£0.62
A100 80 GBPOA1,3102.04BPOA

FP16 baseline. Lower is better. The 3090 at £159 is the cost leader because the price/throughput ratio is most favourable.

Cost per 1M tokens — Mistral 7B FP8 (Blackwell native)

FP8 lifts throughput by 50–60% on Blackwell hardware with negligible quality impact. Cost-per-token drops accordingly:

GPUMonthlyAggregate tok/s (FP8)Tokens/month at 60% utilCost per 1M
RTX 5060 Ti 16 GB£1198801.37B£0.12
RTX 5080£1891,2902.01B£0.11
RTX 5090£3991,9202.99B£0.12
RTX 6000 Pro 96 GB£8991,8602.89B£0.38

FP8 native. The 5080 wins on absolute cost-per-token; the 5090 wins on absolute throughput in the same league.

Self-hosted vs hosted APIs

Comparable hosted-API options for Mistral 7B (or equivalent-class small models), September 2026 pricing:

ProviderModelOutput cost per 1MNotes
OpenAIgpt-4o-mini£0.60Closed, USD-billed
AnthropicClaude 3.5 Haiku£3.20Closed
Together AIMistral 7B Instruct v0.3£0.16Open weights, hosted
Fireworks AIMistral 7B Instruct£0.16Open weights, hosted
GroqMistral 7B (when available)£0.20LPU hardware
GigaGPU dedicated 5080 (FP8)Mistral 7B (yours)£0.11Self-hosted, fixed cost

Self-hosting is dramatically cheaper at non-trivial volumes. The break-even point against Together is roughly £230/mo of API spend.

Why utilisation rate dominates the math

The single biggest variable is the utilisation rate. At 100% util the 5080 hits £0.066/1M; at 30% it is £0.18/1M. Real-world rates we see:

  • Internal tooling / B2B SaaS — 15–25% utilisation. Self-hosting still wins above ~£300/mo of equivalent API spend.
  • Customer-facing chatbot — 30–50% utilisation. Self-hosting wins decisively above ~£150/mo of API spend.
  • RAG-heavy backend with embeddings + LLM — 60–80% utilisation. Self-hosting always wins.
  • Batch / nightly jobs — Spiky 100%-then-idle. Hosted APIs may actually be cheaper if total minutes are low.

The math does not help you decide unless you measure your actual traffic. Run a 7-day token-volume report against your hosted-API bill, divide by 168 hours of GPU-time-equivalent, and you have your real utilisation rate.

Which GPU is the cost leader?

  • Lowest absolute cost per million tokens (FP8): RTX 5080 at £0.09/1M.
  • Lowest absolute cost per million (FP16): RTX 3090 at £0.14/1M.
  • Best for genuinely high concurrency: RTX 5090 — same cost per token as 5080 but 2× the absolute capacity.
  • When 6000 Pro / A100 wins: never on cost-per-token for 7B-class. Pick those for ECC, larger models, or compliance.

Bottom line

For Mistral 7B specifically, the cost story is clear: RTX 5080 if your traffic fits, RTX 5090 when you need more concurrency, RTX 3090 if you cannot use FP8. Self-hosting beats every hosted-API price point we benchmarked at any meaningful volume.

For Llama 3 cost across the same GPUs, see cost per 1M tokens — Llama 3; for the deployment-side details, our API hosting page.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?