Cost per 1M Tokens
Self-Hosted GPU vs API Pricing — How Much Can You Save?
How Much Does LLM Inference Really Cost?
Every API call to OpenAI, Anthropic or Google incurs a per-token charge that scales linearly with usage. For production workloads — customer-facing chatbots, document processing pipelines, code assistants — those costs compound fast. A single GPT-4o conversation averaging 4,000 tokens costs roughly $0.05 at current rates. Run 10,000 conversations a day and you’re looking at $500/day in API fees alone.
Self-hosting an open source model on a dedicated GPU eliminates per-token billing entirely. You pay a fixed monthly rate for the server, then generate as many tokens as you want. The more you use it, the cheaper each token becomes — and you get full control over your data, latency, and model choice.
This page compares the real-world cost of generating one million tokens across major API providers against the effective cost on GigaGPU’s dedicated GPU servers.
API Provider Pricing — Cost per 1M Tokens
Current per-million-token rates for popular models from OpenAI, Anthropic, and Google. Prices in USD as published by each provider.
| Provider | Model | Input / 1M Tokens | Output / 1M Tokens | Tier |
|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | Flagship |
| OpenAI | GPT-4o mini | $0.15 | $0.60 | Budget |
| OpenAI | GPT-4.1 | $2.00 | $8.00 | Mid |
| OpenAI | GPT-4.1 mini | $0.40 | $1.60 | Budget |
| Anthropic | Claude Opus 4.6 | $5.00 | $25.00 | Flagship |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | Mid |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | Budget |
| Gemini 3.1 Pro | $2.00 | $12.00 | Flagship | |
| Gemini 2.5 Pro | $1.25 | $10.00 | Mid | |
| Gemini 2.5 Flash | $0.15 | $0.60 | Budget |
Prices sourced from official provider pricing pages as of April 2026. Standard rates shown — batch and caching discounts may reduce costs for some workloads. All prices in USD. Long-context surcharges (>200K tokens) excluded for clarity.
Self-Hosted Cost per 1M Tokens — GigaGPU Servers
Effective cost per million tokens when you self-host on a dedicated GPU server. Based on estimated throughput running LLaMA 3 8B (Q4_K_M) 24/7 via Ollama on a single GPU.
| GPU | VRAM | ~tok/s | ~Tokens/Month | Server Cost/Mo | Effective $/1M Tokens |
|---|---|---|---|---|---|
| RTX 3050 | 6 GB | ~18 | ~46.7M | £69.00 | ~$1.86 |
| RTX 4060 | 8 GB | ~52 | ~134.8M | £79.00 | ~$0.74 |
| RTX 4060 Ti 16GB | 16 GB | ~68 | ~176.3M | £99.00 | ~$0.71 |
| RTX 3090 | 24 GB | ~85 | ~220.3M | £139.00 | ~$0.79 |
| RX 9070 XT | 16 GB | ~95 | ~246.2M | £129.00 | ~$0.66 |
| Radeon AI Pro R9700 | 32 GB | ~110 | ~285.1M | £199.00 | ~$0.88 |
| RTX 5080 | 16 GB | ~140 | ~362.9M | £249.00 | ~$0.86 |
| RTX 5090 | 32 GB | ~220 | ~570.2M | £399.00 | ~$0.88 |
| RTX 6000 PRO | 96 GB | ~250 | ~648.0M | £999.00 | ~$1.94 |
Effective cost calculated as: (monthly server price in USD) ÷ (tokens generated per month at 24/7 single-user throughput). GBP to USD conversion at approximate £1 = $1.26. Running larger or quantised models will change throughput. Real-world utilisation below 100% will increase effective cost per token.
Cost per 1M Output Tokens — Visual Comparison
Side-by-side view of API output pricing versus effective self-hosted cost on GigaGPU hardware. Lower is better.
GPU costs assume 24/7 utilisation running LLaMA 3 8B Q4_K_M. Effective cost improves with higher utilisation. API prices are standard output rates in USD.
Example: 100M Tokens per Month
What a typical production workload of 100 million output tokens per month costs on an API versus a GigaGPU dedicated server.
GPT-4o API
100M output tokens × $10.00 per 1M tokens. This scales linearly — 200M tokens = $2,000, 500M tokens = $5,000.
RTX 3090 · GigaGPU
Fixed monthly cost. The RTX 3090 generates up to ~220M tokens/month at 24/7 utilisation. No per-token charges, no API limits, no surprises.
At production volumes, self-hosting on a dedicated GPU server is dramatically cheaper than API billing — with the added benefit of full data privacy, zero rate limits, and a predictable monthly bill.
Cost Calculator
Estimate your monthly spend on an API versus a GigaGPU server.
Estimate Your Savings
Why Self-Host Instead of Using an API?
Beyond cost, there are compelling operational reasons to run your own inference on dedicated hardware.
Predictable Monthly Costs
A fixed server bill every month with no surprises. No per-token billing, no usage spikes, and no overage charges. Budget with confidence.
Full Data Privacy
Your prompts and outputs never leave your server. No third-party logging, no data used for training, and full UK data residency for compliance.
No Rate Limits
API providers impose tokens-per-minute and requests-per-minute caps. Self-hosting means you’re limited only by your GPU’s throughput — which you control.
Any Model, Any Version
Run LLaMA, Mistral, Qwen, DeepSeek, or any open source model. Pin exact versions, fine-tune with your data, and switch models whenever you want.
Lower Latency
Dedicated hardware means no multi-tenant queueing. Your requests go straight to the GPU for consistent, low-latency inference every time.
Scales to Zero Cost Per Token
The more tokens you generate, the cheaper each one becomes. At 100% utilisation, mid-range GPUs deliver tokens at under $1 per million.
No Vendor Lock-In
API providers can change pricing, deprecate models, or alter terms of service at any time. With your own server, you own the stack — switch models, frameworks, or inference engines whenever you want without rewriting a single integration.
Fine-Tuning & Customisation
Train LoRA adapters, merge custom weights, or run fully fine-tuned models tailored to your domain. API providers limit you to their model catalogue — dedicated hardware lets you build and deploy models that are uniquely yours.
Benchmark Methodology
How We Calculated Self-Hosted Costs
Model: LLaMA 3 8B quantised to Q4_K_M, running via Ollama on a single GPU with default settings.
Throughput: Single-user, single-GPU token generation speed (tok/s) measured under sustained load. These figures match our Tokens per Second benchmark page.
Monthly tokens: tok/s × 60 × 60 × 24 × 30 = tokens per 30-day month at 100% utilisation.
Effective cost: Server price (converted GBP → USD at ~£1 = $1.26) ÷ monthly token output = cost per million tokens.
Important caveats: Real-world utilisation will be below 100%, which raises effective per-token cost. Larger models (13B, 33B, 70B) produce fewer tokens per second. Concurrent users reduce per-user throughput. The figures above represent a best-case baseline for comparison — your actual cost will depend on model size, quantisation level, and utilisation rate.
API prices: Sourced from official provider pricing pages (OpenAI, Anthropic, Google) as of April 2026. Standard output token rates shown. Batch API and prompt caching discounts are available from most providers but are excluded here for a like-for-like comparison.
Frequently Asked Questions
We take the GPU’s estimated throughput in tokens per second, multiply by the number of seconds in a 30-day month (2,592,000), and divide the server’s monthly price (converted to USD) by that total. This gives the cost to generate one million tokens at maximum utilisation. Real-world costs will be higher if your server isn’t running inference around the clock.
The throughput and cost figures are benchmarked on LLaMA 3 8B at Q4_K_M quantisation. Larger models like 13B, 33B, or 70B will produce fewer tokens per second and therefore have a higher effective cost per million tokens. Smaller or more aggressively quantised models may be faster. Use the calculator above to estimate costs for your specific workload.
Not always. If your usage is very low — say a few thousand tokens per day — an API with pay-per-token billing may be cheaper because you’re not paying for idle server time. Self-hosting becomes more cost-effective as utilisation increases. As a rough guide, if you’re generating more than about 10–20 million tokens per month consistently, a dedicated GPU will typically save you money compared to flagship API models.
Proprietary frontier models like GPT-4o and Claude Opus are generally more capable on complex reasoning and creative tasks. However, open source models have closed the gap significantly — models like LLaMA 3, Mistral, Qwen, and DeepSeek perform extremely well for production use cases including chatbots, document summarisation, code generation, and RAG pipelines. For many workloads, the quality difference is negligible while the cost difference is enormous.
Yes. Tools like Ollama and vLLM support loading multiple models and switching between them. The constraint is VRAM — you need enough GPU memory to hold each active model. For example, a 24GB RTX 3090 can hold a 7B model and a smaller embedding model simultaneously. For multi-model production stacks, GPUs with 32GB+ VRAM like the RTX 5090 or RTX 6000 PRO are recommended.
Yes — most providers offer batch processing (typically 50% off) and prompt caching (up to 90% off repeated inputs). These can significantly reduce costs for eligible workloads. However, batch processing introduces latency (up to 24 hours for results), and caching only helps with repeated prompts. For real-time, high-volume inference with varied inputs, self-hosting remains substantially cheaper.
Most servers are provisioned within a few hours. Once you have SSH access, installing Ollama or vLLM takes a few minutes, and downloading a model is limited only by network speed. You can typically go from order to running inference within the same day.
Stop Paying Per Token
Deploy a dedicated GPU server and generate unlimited tokens at a fixed monthly cost. No contracts, cancel any time.
Browse GPU Servers LLM Hosting Guide →