Home / Blog / Cost & Pricing / AI Inference Cost per Query by Model and GPU

Cost & Pricing

AI Inference Cost per Query by Model and GPU

How much does a single AI query actually cost? We break down inference cost per query across models and GPUs, from Llama 3.1 on an A100 to Mistral on an L40S.

Cost & Pricing April 16, 2026 3 min read admin

At 100,000 queries per day, a single OpenAI GPT-4o call costs roughly $0.005 per query — adding up to $15,000 per month. Self-hosted inference on a dedicated GPU can slash that to $0.0003 per query. Understanding your true cost per million tokens is the first step toward building a profitable AI product.

What Determines Inference Cost per Query?

Three variables dominate your per-query cost: the model size (parameter count), average tokens per request (input plus output), and the GPU you run it on. A 7B-parameter model serving 500-token queries on an RTX 5090 costs a fraction of what a 70B model on an RTX 6000 Pro demands. Throughput matters equally — a GPU running at 80% utilisation via vLLM hosting delivers far lower per-query costs than one sitting idle between requests.

Cost per Query by Model and GPU

Model	GPU	Avg Tokens/Query	Throughput (q/s)	Cost per Query	Monthly (100K/day)
Llama 3.1 8B	RTX 5090	512	38	$0.00008	$240
Llama 3.1 8B	RTX 6000 Pro 96 GB	512	72	$0.00012	$360
Mistral 7B	RTX 6000 Pro	512	45	$0.00010	$300
Llama 3.1 70B	RTX 6000 Pro 96 GB	512	12	$0.00058	$1,740
Llama 3.1 70B	2x RTX 6000 Pro 96 GB	512	22	$0.00063	$1,890
Mixtral 8x7B	RTX 6000 Pro 96 GB	512	28	$0.00025	$750
DeepSeek V3	2x RTX 6000 Pro	512	18	$0.00097	$2,910
Qwen 2.5 72B	2x RTX 6000 Pro 96 GB	512	20	$0.00069	$2,070

Costs assume dedicated GPU hosting at GigaGPU UK rates with sustained utilisation above 60%.

Self-Hosted vs API: The Break-Even Point

API pricing from providers like OpenAI and Anthropic charges per token with no volume cap on cost. Self-hosting has a fixed monthly GPU cost, so per-query price drops as volume increases. The break-even threshold for a Llama 3.1 8B model on a dedicated RTX 5090 is approximately 8,000 queries per day. Beyond that, every additional query is nearly free. Use our LLM cost calculator to model your exact scenario.

Optimising Throughput to Lower Per-Query Cost

Batching requests is the single fastest way to reduce cost. Running vLLM with continuous batching can increase throughput by 3-5x compared to naive sequential inference. Quantisation (AWQ or GPTQ at 4-bit) reduces VRAM requirements and boosts tokens per second by 40-60%, letting you serve a 70B model on a single RTX 6000 Pro instead of two. KV-cache optimisation with PagedAttention further reduces memory waste during concurrent requests.

Choosing the Right GPU for Your Query Volume

For low-volume use cases under 10,000 queries per day, an RTX 5090 or RTX 6000 Pro provides the cheapest inference for 7B-13B models. Mid-volume workloads (10K-100K queries/day) benefit from RTX 6000 Pro 96 GB GPUs running optimised serving stacks. High-volume production systems exceeding 100K queries per day should evaluate multi-GPU clusters to maximise throughput while maintaining latency SLAs under 200ms.

Compare the full cost breakdown across GPUs with our GPU vs API cost comparison tool.

Reduce Your Inference Costs Today

Every query you serve through an API provider at scale is money left on the table. Moving to dedicated GPU hosting with GigaGPU gives you predictable monthly costs, full control over your inference stack, and per-query costs that can be 10-20x lower than API alternatives. Browse our open-source LLM hosting plans or estimate your savings with the LLM cost calculator.

Need a custom configuration for high-volume inference? Private AI hosting gives you isolated infrastructure tailored to your throughput requirements. Check our latest pricing analysis on the cost blog.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Cost & Pricing

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

AI Inference Cost per Query by Model and GPU

What Determines Inference Cost per Query?

Cost per Query by Model and GPU

Self-Hosted vs API: The Break-Even Point

Optimising Throughput to Lower Per-Query Cost

Choosing the Right GPU for Your Query Volume

Reduce Your Inference Costs Today

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

AI Inference Cost per Query by Model and GPU

What Determines Inference Cost per Query?

Cost per Query by Model and GPU

Self-Hosted vs API: The Break-Even Point

Optimising Throughput to Lower Per-Query Cost

Choosing the Right GPU for Your Query Volume

Reduce Your Inference Costs Today

Need a Dedicated GPU Server?

admin

Related Articles

AWS Bedrock vs Dedicated GPU for Multi-Model Inference

LLM Chatbot Hosting: Cost at 500K Messages/Month

GPT-4o vs Self-Hosted LLM: Cost Comparison at Scale

GPU Hosting vs API Pricing: When Does Self-Hosting Pay Off?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?