RTX 3050 - Order Now
Home / Blog / Cost & Pricing / RTX 4090 24GB Break-Even Calculator: Self-Host vs API with Worked Examples and MAU Thresholds
Cost & Pricing

RTX 4090 24GB Break-Even Calculator: Self-Host vs API with Worked Examples and MAU Thresholds

The one formula, the inputs, comprehensive break-even tables for every popular API, MAU thresholds, capacity ceilings and the situations where the formula lies. Crunch the numbers before you commit.

Most “should I self-host” debates end with a vibe rather than a number. They shouldn’t. The break-even between an RTX 4090 24GB dedicated server and any hosted API is a single division: monthly fixed cost divided by API blended rate. This article gives you the formula, the inputs, worked examples for every popular API tier, capacity tables for the open-weight models the 4090 actually runs, monthly active user (MAU) thresholds for typical product shapes and the situations where the headline formula misleads. Wider hardware menu on dedicated GPU hosting.

Contents

The one formula

break_even_tokens_per_month = monthly_fixed_cost / api_blended_$_per_M_tokens

worked: 700 / 5.00 = 140 M tokens vs GPT-4o
        700 / 0.30 = 2,333 M tokens vs GPT-4o-mini

If forecast volume exceeds break_even_tokens_per_month, self-host. If not, stay on the API. Then sanity-check that the 4090 can physically deliver that volume at acceptable latency and quality, and that you have not picked a workload where the formula misleads (long context, sub-30 ms TTFT, hard reasoning).

Inputs you need

InputTypical valueNotes
4090 monthly cost$700 (~£550 midpoint)Flat dedicated; no metering
API input price$0.15 – $15.00 / MVaries wildly
API output price3-5x inputOutput dominates blended for chat
Input:output ratio2:1 typical, 4:1 RAG, 1:2 agent loopsMeasure your actual ratio
Forecast tokens/monthyour numberAnnualise from a 7-day measurement
Self-host modelLlama 8B / 70B, Qwen 14B / 32BPick the cheapest that meets quality
Quality baryour eval suiteBuild before you switch

Compute the API blended rate as (2 * input + output) / 3 for 2:1 in:out ratio, or weight to your measured ratio. For agent backends, output usually dominates because tool-call composition is mostly model speech.

Worked examples by API

The 4090 dedicated UK server costs ~$700/month flat. Break-even tokens for each major hosted API at a 2:1 input:output blend:

API tierInput $/MOutput $/MBlended $/M (2:1)Break-even tokens/moDaily tokens to break-even
OpenAI GPT-4o$2.50$10.00$5.00140 M~4.7 M
OpenAI GPT-4o mini$0.15$0.60$0.302,333 M~78 M
OpenAI GPT-4 Turbo$10.00$30.00$16.6742 M~1.4 M
OpenAI GPT-3.5 Turbo$0.50$1.50$0.83843 M~28 M
Anthropic Claude Sonnet$3.00$15.00$7.00100 M~3.3 M
Anthropic Claude Haiku$0.25$1.25$0.581,207 M~40 M
Anthropic Claude Opus$15.00$75.00$35.0020 M~0.67 M
Together AI Llama 70B$0.88$0.88$0.88795 M~26 M

Worked example: support agent migration

A support team running 1,200 chats/day, 8 turns each, 350 tokens average per turn = ~3.4 M tokens/day = ~100 M tokens/month. On Sonnet that is $700/month: roughly the price of a dedicated 4090. At Sonnet they break even today; if traffic doubles inside a year (200 M/month) the dedicated card saves $700/month and grows to $1,400/month savings against Sonnet at full year-2 volume. Quality match is the gating concern: build a 100-prompt eval and run Llama 70B AWQ vs Sonnet before pulling the trigger.

4090 capacity by model

Sustainable monthly token output assumes 90% utilisation; bursty workloads need bigger headroom and will see lower effective throughput. Aggregate t/s figures are the saturated batch numbers from the underlying benchmark suite.

Self-host modelAggregate t/sTokens/month at 90% utilBreak-evens it covers
Llama 3 8B FP8 + FP8 KV1,140 (sat. batch 64)~2.66 BGPT-4o-mini, GPT-3.5, Haiku, all higher
Mistral 7B FP8~1,200~2.80 BGPT-4o-mini, GPT-3.5, Haiku, all higher
Phi-3 mini FP8~2,000~4.66 BEven GPT-4o-mini at peak volume
Mistral Nemo 12B FP8~750~1.75 BGPT-3.5, Haiku, all higher
Qwen 2.5 14B AWQ~720~1.68 BGPT-3.5, all higher
Qwen 2.5 32B AWQ~280~654 MSonnet, GPT-4o, Mistral Large, all higher
Mixtral 8x7B AWQ~340~793 MSonnet, GPT-4o, all higher
Llama 3 70B AWQ~80~187 MGPT-4o, Sonnet, GPT-4 Turbo, Opus

Sustained Llama 3 8B FP8 capacity is ~2.85 B tokens/month at 100% utilisation, well above any realistic break-even against GPT-4o-mini at 2.33 B/month. Qwen 32B at 654 M/month sits comfortably above the GPT-4o break-even of 140 M and beneath its capacity ceiling. See the underlying benchmarks: 8B, Qwen 14B, Qwen 32B, 70B INT4.

MAU thresholds by product shape

Tokens-per-month is hard to forecast in the abstract; MAU is easier. For typical product shapes, here is the MAU at which the 4090 starts beating each major API at 2:1 in:out blend.

Product shapeTokens/MAU/moMAU to break-even GPT-4oMAU to break-even SonnetMAU to break-even HaikuMAU 4090 cap (8B FP8)
Casual chatbot~50,0002,8002,00024,000~57,000
Support assistant~200,0007005006,000~14,000
RAG knowledge worker~500,0002802002,400~5,700
Agent power-user~1,500,0009567800~1,900
Coding assistant~2,500,0005640484~1,140

Two takeaways. First, MAU thresholds are smaller than most teams expect: a coding-assistant product with 100 paying MAU on GPT-4o is already losing money against the dedicated alternative. Second, the 4090’s MAU cap depends on which model you self-host; for a coding assistant on 8B FP8 the cap is ~1,140 MAU per card before you need a second box. See concurrent users for derivation and coding assistant for that vertical.

Sanity checks before you commit

Three checks before you sign the order:

  1. Quality match: does the open-weight do your task at acceptable quality? Build a 100-prompt eval (real production prompts, not synthetic) and run it through both options before committing. Score by your domain metric, not generic benchmarks.
  2. Concurrency: does your peak request rate fit inside the 4090’s batch window? Aggregate t/s assumes good batching; bursty workloads need bigger headroom. p95 traffic should be at most 70% of nominal capacity.
  3. Latency floor: 70B AWQ on the 4090 has ~80 ms TTFT and 22-24 t/s decode. If your UX needs sub-30 ms TTFT or sub-200 ms full responses, switch to a smaller model (8B FP8 has ~30 ms TTFT and 198 t/s) or evaluate the 5090.

When the formula lies

SituationWhy pure $/M is misleadingWhat to do instead
Strict data residency / GDPRSelf-host wins regardless of volume; API may be non-starterSelf-host at any volume; pick the smallest viable open weight
Spiky traffic, low averageAPI better; you pay only for what you useStay on API until baseline volume rises; revisit quarterly
Long-context heavy (>32k)4090 can do 64k on 8B FP8 but 70B caps at 16kIf you need 70B at 64k, use a denser deployment or API
Agentic loops with retriesToken counts balloon 3-10x; recompute break-even on real trafficMeasure 7 days of real production tokens, not theoretical
Need GPT-4-level reasoningOpen weights still trail on hardest math/logic tasksHybrid: cheap open-weight + API fallback for hard cases
Sub-second UX with first-token latencyAPI often faster TTFT than self-host on small modelsStreaming + smaller open weight, or stay on API
One-off experiments < 10 M tokensAPI convenience dominates; setup cost wastedUse API; don’t capitalise infrastructure for prototypes

Decision matrix and verdict

Monthly volumeQuality barBest option
< 50 M tokensanyHosted API
50-150 M tokensmatch Llama 70BClose call; self-host wins on cost, API on convenience
150-500 M tokensmatch Qwen 32B4090 self-host clear win
500 M-1.5 B tokensmatch Qwen 14B4090 with 8B/14B comfortably wins
1.5-2.5 B tokensmatch 8BSingle 4090 near cap; provision a second early
> 3 B tokensanyMultiple 4090s or upgrade to 5090

Verdict

For most production workloads above 100-200 M tokens/month, dedicated 4090 plus an open-weight model is the cheapest credible option in 2026, particularly when you can replace GPT-4o or Sonnet with Qwen 2.5 32B or Llama 3 70B AWQ at acceptable quality. Below 50 M tokens/month a hosted API wins on convenience. Between 50 and 150 M, run the formula plus a quality eval; the answer is rarely close once both are honest. For the full TCO including engineer time, see 12-month ROI analysis.

Crunch the numbers, then pull the trigger

Predictable monthly billing on a dedicated 4090, no token meter. UK dedicated hosting.

Order the RTX 4090 24GB

See also: vs OpenAI API cost, vs Anthropic API cost, vs Together AI, 12-month ROI, monthly cost, tokens per watt, 5060 Ti calculator.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?