Cost per 1M Tokens

Q: Is self-hosting always cheaper than using an API?

Not always. If usage is very low, pay-per-token API billing may be cheaper. Self-hosting becomes more cost-effective as utilisation increases, typically breaking even at around 10-20 million tokens per month.

Q: What about the quality difference between open source models and GPT-4o or Claude?

Proprietary frontier models are generally more capable on complex reasoning, but open source models like LLaMA 3, Mistral, Qwen and DeepSeek perform extremely well for production chatbots, summarisation, code generation and RAG pipelines.

Q: Can I run multiple models on one server?

Yes. Tools like Ollama and vLLM support multiple models. The constraint is VRAM — you need enough GPU memory for each active model. GPUs with 32GB+ are recommended for multi-model setups.

Q: Do API providers offer any discounts that close the gap?

Yes, most providers offer batch processing (50% off) and prompt caching (up to 90% off repeated inputs). However, batch processing adds latency and caching only helps with repeated prompts. For real-time high-volume inference, self-hosting remains substantially cheaper.

Q: How quickly can I deploy a model on a GigaGPU server?

Most servers are provisioned within a few hours. Installing Ollama or vLLM takes minutes, and you can typically go from order to running inference the same day.

Self-Hosted GPU vs API Pricing — How Much Can You Save?

How Much Does LLM Inference Really Cost?

Every API call to OpenAI, Anthropic or Google incurs a per-token charge that scales linearly with usage. For production workloads — customer-facing chatbots, document processing pipelines, code assistants — those costs compound fast. A single GPT-4o conversation averaging 4,000 tokens costs roughly $0.05 at current rates. Run 10,000 conversations a day and you’re looking at $500/day in API fees alone.

Self-hosting an open source model on a dedicated GPU eliminates per-token billing entirely. You pay a fixed monthly rate for the server, then generate as many tokens as you want. The more you use it, the cheaper each token becomes — and you get full control over your data, latency, and model choice.

This page compares the real-world cost of generating one million tokens across major API providers against the effective cost on GigaGPU’s dedicated GPU servers.

90%+

Potential Savings

£0

Per-Token Charges

24/7

Unlimited Inference

Datacenter

API Provider Pricing — Cost per 1M Tokens

Current per-million-token rates for popular models from OpenAI, Anthropic, and Google. Prices in USD as published by each provider.

Provider	Model	Input / 1M Tokens	Output / 1M Tokens	Tier
OpenAI	GPT-4o	$2.50	$10.00	Flagship
OpenAI	GPT-4o mini	$0.15	$0.60	Budget
OpenAI	GPT-4.1	$2.00	$8.00	Mid
OpenAI	GPT-4.1 mini	$0.40	$1.60	Budget
Anthropic	Claude Opus 4.6	$5.00	$25.00	Flagship
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	Mid
Anthropic	Claude Haiku 4.5	$1.00	$5.00	Budget
Google	Gemini 3.1 Pro	$2.00	$12.00	Flagship
Google	Gemini 2.5 Pro	$1.25	$10.00	Mid
Google	Gemini 2.5 Flash	$0.15	$0.60	Budget

Prices sourced from official provider pricing pages as of April 2026. Standard rates shown — batch and caching discounts may reduce costs for some workloads. All prices in USD. Long-context surcharges (>200K tokens) excluded for clarity.

Self-Hosted Cost per 1M Tokens — GigaGPU Servers

Effective cost per million tokens when you self-host on a dedicated GPU server. Based on estimated throughput running LLaMA 3 8B (Q4_K_M) 24/7 via Ollama on a single GPU.

GPU	VRAM	~tok/s	~Tokens/Month	Server Cost/Mo	Effective $/1M Tokens
RTX 3050	6 GB	~18	~46.7M	£69.00	~$1.86
RTX 4060	8 GB	~52	~134.8M	£79.00	~$0.74
RTX 4060 Ti 16GB	16 GB	~68	~176.3M	£99.00	~$0.71
RTX 3090	24 GB	~85	~220.3M	£139.00	~$0.79
RX 9070 XT	16 GB	~95	~246.2M	£129.00	~$0.66
Radeon AI Pro R9700	32 GB	~110	~285.1M	£199.00	~$0.88
RTX 5080	16 GB	~140	~362.9M	£249.00	~$0.86
RTX 5090	32 GB	~220	~570.2M	£399.00	~$0.88
RTX 6000 PRO	96 GB	~250	~648.0M	£999.00	~$1.94

Effective cost calculated as: (monthly server price in USD) ÷ (tokens generated per month at 24/7 single-user throughput). GBP to USD conversion at approximate £1 = $1.26. Running larger or quantised models will change throughput. Real-world utilisation below 100% will increase effective cost per token.

Cost per 1M Output Tokens — Visual Comparison

Side-by-side view of API output pricing versus effective self-hosted cost on GigaGPU hardware. Lower is better.

API Providers — Output per 1M Tokens

Claude Opus 4.6Anthropic

$25.00

Claude Sonnet 4.6Anthropic

$15.00

Gemini 3.1 ProGoogle

$12.00

GPT-4oOpenAI

$10.00

GPT-4.1OpenAI

$8.00

Claude Haiku 4.5Anthropic

$5.00

GigaGPU Self-Hosted — Effective Cost per 1M Tokens

RTX 6000 PRO96GB · Enterprise

~$1.94

RTX 30506GB · Starter

~$1.86

RTX 509032GB · Flagship

~$0.88

RTX 309024GB · Most Popular

~$0.79

RTX 4060 Ti 16GB16GB · Best Value

~$0.71

RX 9070 XT16GB · AMD

~$0.66

GPU costs assume 24/7 utilisation running LLaMA 3 8B Q4_K_M. Effective cost improves with higher utilisation. API prices are standard output rates in USD.

Example: 100M Tokens per Month

What a typical production workload of 100 million output tokens per month costs on an API versus a GigaGPU dedicated server.

GPT-4o API

$1,000/mo

100M output tokens × $10.00 per 1M tokens. This scales linearly — 200M tokens = $2,000, 500M tokens = $5,000.

RTX 3090 · GigaGPU

~£139/mo

Fixed monthly cost. The RTX 3090 generates up to ~220M tokens/month at 24/7 utilisation. No per-token charges, no API limits, no surprises.

Save 80–95%

At production volumes, self-hosting on a dedicated GPU server is dramatically cheaper than API billing — with the added benefit of full data privacy, zero rate limits, and a predictable monthly bill.

Cost Calculator

Estimate your monthly spend on an API versus a GigaGPU server.

Estimate Your Savings

Compare against API:

GigaGPU server:

Monthly output tokens: million

API cost: $1,000.00/mo

GigaGPU server: ~$175.14/mo (fixed)

You save ~$824.86/mo (82%)

Why Self-Host Instead of Using an API?

Beyond cost, there are compelling operational reasons to run your own inference on dedicated hardware.

Predictable Monthly Costs

A fixed server bill every month with no surprises. No per-token billing, no usage spikes, and no overage charges. Budget with confidence.

Full Data Privacy

Your prompts and outputs never leave your server. No third-party logging, no data used for training, and full UK data residency for compliance.

No Rate Limits

API providers impose tokens-per-minute and requests-per-minute caps. Self-hosting means you’re limited only by your GPU’s throughput — which you control.

Any Model, Any Version

Run LLaMA, Mistral, Qwen, DeepSeek, or any open source model. Pin exact versions, fine-tune with your data, and switch models whenever you want.

Lower Latency

Dedicated hardware means no multi-tenant queueing. Your requests go straight to the GPU for consistent, low-latency inference every time.

Scales to Zero Cost Per Token

The more tokens you generate, the cheaper each one becomes. At 100% utilisation, mid-range GPUs deliver tokens at under $1 per million.

No Vendor Lock-In

API providers can change pricing, deprecate models, or alter terms of service at any time. With your own server, you own the stack — switch models, frameworks, or inference engines whenever you want without rewriting a single integration.

Fine-Tuning & Customisation

Train LoRA adapters, merge custom weights, or run fully fine-tuned models tailored to your domain. API providers limit you to their model catalogue — dedicated hardware lets you build and deploy models that are uniquely yours.

Benchmark Methodology

How We Calculated Self-Hosted Costs

Model: LLaMA 3 8B quantised to Q4_K_M, running via Ollama on a single GPU with default settings.

Throughput: Single-user, single-GPU token generation speed (tok/s) measured under sustained load. These figures match our Tokens per Second benchmark page.

Monthly tokens: tok/s × 60 × 60 × 24 × 30 = tokens per 30-day month at 100% utilisation.

Effective cost: Server price (converted GBP → USD at ~£1 = $1.26) ÷ monthly token output = cost per million tokens.

Important caveats: Real-world utilisation will be below 100%, which raises effective per-token cost. Larger models (13B, 33B, 70B) produce fewer tokens per second. Concurrent users reduce per-user throughput. The figures above represent a best-case baseline for comparison — your actual cost will depend on model size, quantisation level, and utilisation rate.

API prices: Sourced from official provider pricing pages (OpenAI, Anthropic, Google) as of April 2026. Standard output token rates shown. Batch API and prompt caching discounts are available from most providers but are excluded here for a like-for-like comparison.

Frequently Asked Questions

How is the effective cost per 1M tokens calculated?

We take the GPU’s estimated throughput in tokens per second, multiply by the number of seconds in a 30-day month (2,592,000), and divide the server’s monthly price (converted to USD) by that total. This gives the cost to generate one million tokens at maximum utilisation. Real-world costs will be higher if your server isn’t running inference around the clock.

Do these figures apply to all models or just LLaMA 3 8B?

The throughput and cost figures are benchmarked on LLaMA 3 8B at Q4_K_M quantisation. Larger models like 13B, 33B, or 70B will produce fewer tokens per second and therefore have a higher effective cost per million tokens. Smaller or more aggressively quantised models may be faster. Use the calculator above to estimate costs for your specific workload.

Is self-hosting always cheaper than using an API?

Not always. If your usage is very low — say a few thousand tokens per day — an API with pay-per-token billing may be cheaper because you’re not paying for idle server time. Self-hosting becomes more cost-effective as utilisation increases. As a rough guide, if you’re generating more than about 10–20 million tokens per month consistently, a dedicated GPU will typically save you money compared to flagship API models.

What about the quality difference between open source models and GPT-4o or Claude?

Proprietary frontier models like GPT-4o and Claude Opus are generally more capable on complex reasoning and creative tasks. However, open source models have closed the gap significantly — models like LLaMA 3, Mistral, Qwen, and DeepSeek perform extremely well for production use cases including chatbots, document summarisation, code generation, and RAG pipelines. For many workloads, the quality difference is negligible while the cost difference is enormous.

Can I run multiple models on one server?

Yes. Tools like Ollama and vLLM support loading multiple models and switching between them. The constraint is VRAM — you need enough GPU memory to hold each active model. For example, a 24GB RTX 3090 can hold a 7B model and a smaller embedding model simultaneously. For multi-model production stacks, GPUs with 32GB+ VRAM like the RTX 5090 or RTX 6000 PRO are recommended.

Do API providers offer any discounts that close the gap?

Yes — most providers offer batch processing (typically 50% off) and prompt caching (up to 90% off repeated inputs). These can significantly reduce costs for eligible workloads. However, batch processing introduces latency (up to 24 hours for results), and caching only helps with repeated prompts. For real-time, high-volume inference with varied inputs, self-hosting remains substantially cheaper.

How quickly can I deploy a model on a GigaGPU server?

Most servers are provisioned within a few hours. Once you have SSH access, installing Ollama or vLLM takes a few minutes, and downloading a model is limited only by network speed. You can typically go from order to running inference within the same day.

Stop Paying Per Token

Deploy a dedicated GPU server and generate unlimited tokens at a fixed monthly cost. No contracts, cancel any time.

Browse GPU Servers LLM Hosting Guide →