Most “should I self-host” debates end with a vibe rather than a number. They shouldn’t. The break-even between an RTX 4090 24GB dedicated server and any hosted API is a single division: monthly fixed cost divided by API blended rate. This article gives you the formula, the inputs, worked examples for every popular API tier, capacity tables for the open-weight models the 4090 actually runs, monthly active user (MAU) thresholds for typical product shapes and the situations where the headline formula misleads. Wider hardware menu on dedicated GPU hosting.
Contents
- The one formula
- Inputs you need
- Worked examples by API
- 4090 capacity by model
- MAU thresholds by product shape
- Sanity checks before you commit
- When the formula lies
- Decision matrix and verdict
The one formula
break_even_tokens_per_month = monthly_fixed_cost / api_blended_$_per_M_tokens
worked: 700 / 5.00 = 140 M tokens vs GPT-4o
700 / 0.30 = 2,333 M tokens vs GPT-4o-mini
If forecast volume exceeds break_even_tokens_per_month, self-host. If not, stay on the API. Then sanity-check that the 4090 can physically deliver that volume at acceptable latency and quality, and that you have not picked a workload where the formula misleads (long context, sub-30 ms TTFT, hard reasoning).
Inputs you need
| Input | Typical value | Notes |
|---|---|---|
| 4090 monthly cost | $700 (~£550 midpoint) | Flat dedicated; no metering |
| API input price | $0.15 – $15.00 / M | Varies wildly |
| API output price | 3-5x input | Output dominates blended for chat |
| Input:output ratio | 2:1 typical, 4:1 RAG, 1:2 agent loops | Measure your actual ratio |
| Forecast tokens/month | your number | Annualise from a 7-day measurement |
| Self-host model | Llama 8B / 70B, Qwen 14B / 32B | Pick the cheapest that meets quality |
| Quality bar | your eval suite | Build before you switch |
Compute the API blended rate as (2 * input + output) / 3 for 2:1 in:out ratio, or weight to your measured ratio. For agent backends, output usually dominates because tool-call composition is mostly model speech.
Worked examples by API
The 4090 dedicated UK server costs ~$700/month flat. Break-even tokens for each major hosted API at a 2:1 input:output blend:
| API tier | Input $/M | Output $/M | Blended $/M (2:1) | Break-even tokens/mo | Daily tokens to break-even |
|---|---|---|---|---|---|
| OpenAI GPT-4o | $2.50 | $10.00 | $5.00 | 140 M | ~4.7 M |
| OpenAI GPT-4o mini | $0.15 | $0.60 | $0.30 | 2,333 M | ~78 M |
| OpenAI GPT-4 Turbo | $10.00 | $30.00 | $16.67 | 42 M | ~1.4 M |
| OpenAI GPT-3.5 Turbo | $0.50 | $1.50 | $0.83 | 843 M | ~28 M |
| Anthropic Claude Sonnet | $3.00 | $15.00 | $7.00 | 100 M | ~3.3 M |
| Anthropic Claude Haiku | $0.25 | $1.25 | $0.58 | 1,207 M | ~40 M |
| Anthropic Claude Opus | $15.00 | $75.00 | $35.00 | 20 M | ~0.67 M |
| Together AI Llama 70B | $0.88 | $0.88 | $0.88 | 795 M | ~26 M |
Worked example: support agent migration
A support team running 1,200 chats/day, 8 turns each, 350 tokens average per turn = ~3.4 M tokens/day = ~100 M tokens/month. On Sonnet that is $700/month: roughly the price of a dedicated 4090. At Sonnet they break even today; if traffic doubles inside a year (200 M/month) the dedicated card saves $700/month and grows to $1,400/month savings against Sonnet at full year-2 volume. Quality match is the gating concern: build a 100-prompt eval and run Llama 70B AWQ vs Sonnet before pulling the trigger.
4090 capacity by model
Sustainable monthly token output assumes 90% utilisation; bursty workloads need bigger headroom and will see lower effective throughput. Aggregate t/s figures are the saturated batch numbers from the underlying benchmark suite.
| Self-host model | Aggregate t/s | Tokens/month at 90% util | Break-evens it covers |
|---|---|---|---|
| Llama 3 8B FP8 + FP8 KV | 1,140 (sat. batch 64) | ~2.66 B | GPT-4o-mini, GPT-3.5, Haiku, all higher |
| Mistral 7B FP8 | ~1,200 | ~2.80 B | GPT-4o-mini, GPT-3.5, Haiku, all higher |
| Phi-3 mini FP8 | ~2,000 | ~4.66 B | Even GPT-4o-mini at peak volume |
| Mistral Nemo 12B FP8 | ~750 | ~1.75 B | GPT-3.5, Haiku, all higher |
| Qwen 2.5 14B AWQ | ~720 | ~1.68 B | GPT-3.5, all higher |
| Qwen 2.5 32B AWQ | ~280 | ~654 M | Sonnet, GPT-4o, Mistral Large, all higher |
| Mixtral 8x7B AWQ | ~340 | ~793 M | Sonnet, GPT-4o, all higher |
| Llama 3 70B AWQ | ~80 | ~187 M | GPT-4o, Sonnet, GPT-4 Turbo, Opus |
Sustained Llama 3 8B FP8 capacity is ~2.85 B tokens/month at 100% utilisation, well above any realistic break-even against GPT-4o-mini at 2.33 B/month. Qwen 32B at 654 M/month sits comfortably above the GPT-4o break-even of 140 M and beneath its capacity ceiling. See the underlying benchmarks: 8B, Qwen 14B, Qwen 32B, 70B INT4.
MAU thresholds by product shape
Tokens-per-month is hard to forecast in the abstract; MAU is easier. For typical product shapes, here is the MAU at which the 4090 starts beating each major API at 2:1 in:out blend.
| Product shape | Tokens/MAU/mo | MAU to break-even GPT-4o | MAU to break-even Sonnet | MAU to break-even Haiku | MAU 4090 cap (8B FP8) |
|---|---|---|---|---|---|
| Casual chatbot | ~50,000 | 2,800 | 2,000 | 24,000 | ~57,000 |
| Support assistant | ~200,000 | 700 | 500 | 6,000 | ~14,000 |
| RAG knowledge worker | ~500,000 | 280 | 200 | 2,400 | ~5,700 |
| Agent power-user | ~1,500,000 | 95 | 67 | 800 | ~1,900 |
| Coding assistant | ~2,500,000 | 56 | 40 | 484 | ~1,140 |
Two takeaways. First, MAU thresholds are smaller than most teams expect: a coding-assistant product with 100 paying MAU on GPT-4o is already losing money against the dedicated alternative. Second, the 4090’s MAU cap depends on which model you self-host; for a coding assistant on 8B FP8 the cap is ~1,140 MAU per card before you need a second box. See concurrent users for derivation and coding assistant for that vertical.
Sanity checks before you commit
Three checks before you sign the order:
- Quality match: does the open-weight do your task at acceptable quality? Build a 100-prompt eval (real production prompts, not synthetic) and run it through both options before committing. Score by your domain metric, not generic benchmarks.
- Concurrency: does your peak request rate fit inside the 4090’s batch window? Aggregate t/s assumes good batching; bursty workloads need bigger headroom. p95 traffic should be at most 70% of nominal capacity.
- Latency floor: 70B AWQ on the 4090 has ~80 ms TTFT and 22-24 t/s decode. If your UX needs sub-30 ms TTFT or sub-200 ms full responses, switch to a smaller model (8B FP8 has ~30 ms TTFT and 198 t/s) or evaluate the 5090.
When the formula lies
| Situation | Why pure $/M is misleading | What to do instead |
|---|---|---|
| Strict data residency / GDPR | Self-host wins regardless of volume; API may be non-starter | Self-host at any volume; pick the smallest viable open weight |
| Spiky traffic, low average | API better; you pay only for what you use | Stay on API until baseline volume rises; revisit quarterly |
| Long-context heavy (>32k) | 4090 can do 64k on 8B FP8 but 70B caps at 16k | If you need 70B at 64k, use a denser deployment or API |
| Agentic loops with retries | Token counts balloon 3-10x; recompute break-even on real traffic | Measure 7 days of real production tokens, not theoretical |
| Need GPT-4-level reasoning | Open weights still trail on hardest math/logic tasks | Hybrid: cheap open-weight + API fallback for hard cases |
| Sub-second UX with first-token latency | API often faster TTFT than self-host on small models | Streaming + smaller open weight, or stay on API |
| One-off experiments < 10 M tokens | API convenience dominates; setup cost wasted | Use API; don’t capitalise infrastructure for prototypes |
Decision matrix and verdict
| Monthly volume | Quality bar | Best option |
|---|---|---|
| < 50 M tokens | any | Hosted API |
| 50-150 M tokens | match Llama 70B | Close call; self-host wins on cost, API on convenience |
| 150-500 M tokens | match Qwen 32B | 4090 self-host clear win |
| 500 M-1.5 B tokens | match Qwen 14B | 4090 with 8B/14B comfortably wins |
| 1.5-2.5 B tokens | match 8B | Single 4090 near cap; provision a second early |
| > 3 B tokens | any | Multiple 4090s or upgrade to 5090 |
Verdict
For most production workloads above 100-200 M tokens/month, dedicated 4090 plus an open-weight model is the cheapest credible option in 2026, particularly when you can replace GPT-4o or Sonnet with Qwen 2.5 32B or Llama 3 70B AWQ at acceptable quality. Below 50 M tokens/month a hosted API wins on convenience. Between 50 and 150 M, run the formula plus a quality eval; the answer is rarely close once both are honest. For the full TCO including engineer time, see 12-month ROI analysis.
Crunch the numbers, then pull the trigger
Predictable monthly billing on a dedicated 4090, no token meter. UK dedicated hosting.
Order the RTX 4090 24GBSee also: vs OpenAI API cost, vs Anthropic API cost, vs Together AI, 12-month ROI, monthly cost, tokens per watt, 5060 Ti calculator.