At 100,000 queries per day, a single OpenAI GPT-4o call costs roughly $0.005 per query — adding up to $15,000 per month. Self-hosted inference on a dedicated GPU can slash that to $0.0003 per query. Understanding your true cost per million tokens is the first step toward building a profitable AI product.
What Determines Inference Cost per Query?
Three variables dominate your per-query cost: the model size (parameter count), average tokens per request (input plus output), and the GPU you run it on. A 7B-parameter model serving 500-token queries on an RTX 5090 costs a fraction of what a 70B model on an RTX 6000 Pro demands. Throughput matters equally — a GPU running at 80% utilisation via vLLM hosting delivers far lower per-query costs than one sitting idle between requests.
Cost per Query by Model and GPU
| Model | GPU | Avg Tokens/Query | Throughput (q/s) | Cost per Query | Monthly (100K/day) |
|---|---|---|---|---|---|
| Llama 3.1 8B | RTX 5090 | 512 | 38 | $0.00008 | $240 |
| Llama 3.1 8B | RTX 6000 Pro 96 GB | 512 | 72 | $0.00012 | $360 |
| Mistral 7B | RTX 6000 Pro | 512 | 45 | $0.00010 | $300 |
| Llama 3.1 70B | RTX 6000 Pro 96 GB | 512 | 12 | $0.00058 | $1,740 |
| Llama 3.1 70B | 2x RTX 6000 Pro 96 GB | 512 | 22 | $0.00063 | $1,890 |
| Mixtral 8x7B | RTX 6000 Pro 96 GB | 512 | 28 | $0.00025 | $750 |
| DeepSeek V3 | 2x RTX 6000 Pro | 512 | 18 | $0.00097 | $2,910 |
| Qwen 2.5 72B | 2x RTX 6000 Pro 96 GB | 512 | 20 | $0.00069 | $2,070 |
Costs assume dedicated GPU hosting at GigaGPU UK rates with sustained utilisation above 60%.
Self-Hosted vs API: The Break-Even Point
API pricing from providers like OpenAI and Anthropic charges per token with no volume cap on cost. Self-hosting has a fixed monthly GPU cost, so per-query price drops as volume increases. The break-even threshold for a Llama 3.1 8B model on a dedicated RTX 5090 is approximately 8,000 queries per day. Beyond that, every additional query is nearly free. Use our LLM cost calculator to model your exact scenario.
Optimising Throughput to Lower Per-Query Cost
Batching requests is the single fastest way to reduce cost. Running vLLM with continuous batching can increase throughput by 3-5x compared to naive sequential inference. Quantisation (AWQ or GPTQ at 4-bit) reduces VRAM requirements and boosts tokens per second by 40-60%, letting you serve a 70B model on a single RTX 6000 Pro instead of two. KV-cache optimisation with PagedAttention further reduces memory waste during concurrent requests.
Choosing the Right GPU for Your Query Volume
For low-volume use cases under 10,000 queries per day, an RTX 5090 or RTX 6000 Pro provides the cheapest inference for 7B-13B models. Mid-volume workloads (10K-100K queries/day) benefit from RTX 6000 Pro 96 GB GPUs running optimised serving stacks. High-volume production systems exceeding 100K queries per day should evaluate multi-GPU clusters to maximise throughput while maintaining latency SLAs under 200ms.
Compare the full cost breakdown across GPUs with our GPU vs API cost comparison tool.
Reduce Your Inference Costs Today
Every query you serve through an API provider at scale is money left on the table. Moving to dedicated GPU hosting with GigaGPU gives you predictable monthly costs, full control over your inference stack, and per-query costs that can be 10-20x lower than API alternatives. Browse our open-source LLM hosting plans or estimate your savings with the LLM cost calculator.
Need a custom configuration for high-volume inference? Private AI hosting gives you isolated infrastructure tailored to your throughput requirements. Check our latest pricing analysis on the cost blog.