Tokens/sec Benchmark
GPU Token Throughput Benchmarks for LLM Inference
Estimated tokens per second for every GPU in our hosting lineup. Use these benchmarks to find the right GPU for your AI inference workload and budget.
How Fast Is Each GPU for LLM Inference?
Token throughput — measured in tokens per second (tok/s) — is the single most important metric when choosing a GPU for large language model inference. It directly determines how fast your chatbot responds, how many concurrent users your API can handle, and how quickly batch jobs complete.
We benchmark every GPU in our dedicated server lineup under identical conditions so you can make apples-to-apples comparisons. Whether you’re running a small 3B model for internal tooling or a full 70B parameter model for a production API, the numbers below will help you pick the right card.
Tokens Per Second by GPU — Visual Chart
Estimated throughput running LLaMA 3 8B at Q4_K_M via Ollama. Single user, single GPU. Higher is faster.
LLaMA 3 8B · Q4_K_M · Ollama · single-user, single-GPU · higher = faster
Full Benchmark Table — All GPUs
Detailed throughput, VRAM, maximum model size at Q4 quantisation, and relative performance for every GPU we offer.
| GPU | VRAM | LLaMA 3 8B tok/s | Max Model (Q4) | Best For | Relative Performance |
|---|---|---|---|---|---|
| RTX 3050 6GB | 6 GB | ~18 tok/s | ~5B | Testing & prototyping | |
| RTX 4060 8GB | 8 GB | ~52 tok/s | ~7B | Small model inference | |
| RTX 5060 8GB | 8 GB | ~60 tok/s | ~7B | Budget Blackwell inference | |
| RTX 4060 Ti 16GB | 16 GB | ~68 tok/s | 13B | Mid-range with extra VRAM | |
| Arc Pro B70 32GB | 32 GB | ~70 tok/s | 33B | Large VRAM on a budget | |
| Ryzen AI MAX+ 395 | 128 GB shared | ~80 tok/s | 70B+ | Huge models via unified memory | |
| RTX 3090 24GB | 24 GB | ~85 tok/s | 33B | Best value all-rounder | |
| RX 9070 XT 16GB | 16 GB | ~95 tok/s | 13B | Fast AMD for 7B–13B | |
| Radeon AI Pro R9700 | 32 GB | ~110 tok/s | 70B Q2 | Pro AMD — speed + VRAM | |
| RTX 5080 16GB | 16 GB | ~140 tok/s | 13B | Fastest for 7B–13B models | |
| RTX 5090 32GB | 32 GB | ~220 tok/s | 70B Q2 | Fastest for production APIs | |
| RTX 6000 PRO 96GB | 96 GB | ~245 tok/s | 405B Q2 | Enterprise — largest models + fastest 8B |
What Affects Token Throughput?
Token speed isn’t determined by the GPU alone. Understanding these factors will help you get the most from your hardware.
VRAM Capacity
The model must fit entirely in GPU memory for maximum speed. When a model exceeds VRAM, layers spill to system RAM and throughput drops dramatically — often by 5–10×.
Quantisation Level
Q4_K_M (4-bit) is the most common quantisation for inference. Lower bit depths (Q2, Q3) reduce quality but fit larger models. FP16 doubles VRAM usage but gives best output quality.
Memory Bandwidth
LLM inference is memory-bandwidth bound, not compute bound. GPUs with higher GB/s bandwidth (like RTX 5090’s 1,792 GB/s) generate tokens faster regardless of raw TFLOPS.
Concurrent Users
Benchmarks show single-user throughput. With multiple concurrent requests, total throughput increases but per-user speed decreases. Frameworks like vLLM optimise for concurrency.
Context Length
Longer context windows consume more VRAM and slow down inference. A 32K context request will be noticeably slower than a 2K context request on the same GPU and model.
Inference Runtime
Different inference engines (Ollama, vLLM, llama.cpp, TGI) have different performance characteristics. vLLM typically excels at high-concurrency serving; Ollama is easiest for single-user setups.
Model Architecture
Not all models of the same parameter count perform equally. Architecture choices like grouped-query attention (GQA), mixture-of-experts (MoE), and sliding-window attention all affect inference speed. A 7B MoE model can be faster than a dense 7B model at the same quality.
Thermal Throttling
GPUs reduce clock speeds when temperatures exceed safe limits. Sustained inference workloads generate continuous heat — proper server cooling and airflow prevent thermal throttling and maintain consistent throughput over long periods.
Prompt Length vs Generation
Processing the input prompt (prefill) and generating output tokens (decode) use the GPU differently. Long prompts create a larger initial delay before the first token appears, while decode speed determines how fast the response streams. A 10K token prompt takes significantly longer to prefill than a 100 token prompt on the same GPU.
System RAM & NVMe
While VRAM handles the model weights, system RAM and NVMe speed affect model loading times, KV cache overflow, and context management. 128GB DDR5 ensures headroom for large context windows and prevents swap-to-disk bottlenecks during inference.
KV Cache Size
The key-value cache stores attention state for each token in the context. Larger context windows and higher batch sizes rapidly increase KV cache memory usage — often consuming as much VRAM as the model weights themselves at 32K+ context lengths.
Speculative Decoding
Speculative decoding uses a small “draft” model to predict multiple tokens ahead, then verifies them on the full model in a single pass. When predictions are accurate, this can boost effective tok/s by 2–3× without any quality loss — but it requires extra VRAM for the draft model.
Top GPU Picks by Use Case
Not sure which GPU you need? Here are our top recommendations based on common inference workloads.
Token throughput figures are rough estimates under single-user, single-GPU conditions at Q4_K_M quantisation. Real-world performance varies significantly with concurrent requests, context length, cooling, and configuration.
How to Choose a GPU Based on Benchmarks
Matching your workload to the right hardware doesn’t have to be complicated. Follow these guidelines.
Prototyping & Dev
Running 3B–7B models for testing? An RTX 4060 or RTX 5060 gives you 50–60 tok/s at under £90/mo. Enough for local development without breaking the bank.
Production API (7B–13B)
For customer-facing APIs running models like Mistral 7B or LLaMA 3 8B, the RTX 5080 at 140 tok/s gives excellent per-user response times with room for concurrency.
Large Models (33B–70B)
Models above 13B need 24GB+ VRAM. The RTX 3090 (24GB) handles 33B at Q4. For 70B, look at the RTX 5090 (32GB) or Radeon AI Pro R9700 (32GB).
Enterprise (70B–405B)
Running LLaMA 3.1 405B or similar? The RTX 6000 PRO with 96GB of VRAM is the only single-GPU option that fits these models without multi-GPU setups.
Frequently Asked Questions
Available on all servers
- 1Gbps Port
- NVMe Storage
- 128GB DDR4/DDR5
- Any OS
- 99.9% Uptime
- Root/Admin Access
Our dedicated GPU servers provide full hardware resources and a dedicated GPU card, ensuring unmatched performance and privacy. Perfect for LLM inference, AI model serving, fine-tuning, and any deep learning workload — with no shared resources and no token fees.
Get in Touch
Not sure which GPU is right for your inference workload? Our team can help you match your model size and throughput requirements to the ideal server configuration.
Contact Sales →Or browse the knowledgebase for setup guides on Ollama, vLLM, and more.
Ready to Deploy? Choose Your GPU
Flat monthly pricing. Full GPU resources. UK data centre. Pick the GPU that matches your throughput needs and deploy in under an hour.