RTX 3050 - Order Now

Tokens/sec Benchmark

GPU Token Throughput Benchmarks for LLM Inference

Estimated tokens per second for every GPU in our hosting lineup. Use these benchmarks to find the right GPU for your AI inference workload and budget.

How Fast Is Each GPU for LLM Inference?

Token throughput — measured in tokens per second (tok/s) — is the single most important metric when choosing a GPU for large language model inference. It directly determines how fast your chatbot responds, how many concurrent users your API can handle, and how quickly batch jobs complete.

We benchmark every GPU in our dedicated server lineup under identical conditions so you can make apples-to-apples comparisons. Whether you’re running a small 3B model for internal tooling or a full 70B parameter model for a production API, the numbers below will help you pick the right card.

12
GPUs Tested
245
Peak tok/s
Q4_K_M
Quantisation
Ollama
Runtime
8B
Reference Model
6–96 GB
VRAM Range
UK
Data Centre
£69+
Starting Price

Tokens Per Second by GPU — Visual Chart

Estimated throughput running LLaMA 3 8B at Q4_K_M via Ollama. Single user, single GPU. Higher is faster.

RTX 6000 PRO
~245 tok/s
245
RTX 5090
~220 tok/s
220
RTX 5080
~140 tok/s
140
R9700
~110 tok/s
110
RX 9070 XT
~95 tok/s
95
RTX 3090
~85 tok/s
85
Ryzen AI MAX+
~80 tok/s
80
Arc Pro B70
~70 tok/s
70
RTX 4060 Ti
~68 tok/s
68
RTX 5060
~60 tok/s
60
RTX 4060
~52 tok/s
52
RTX 3050
~18 tok/s
18

LLaMA 3 8B · Q4_K_M · Ollama · single-user, single-GPU · higher = faster

Full Benchmark Table — All GPUs

Detailed throughput, VRAM, maximum model size at Q4 quantisation, and relative performance for every GPU we offer.

GPUVRAMLLaMA 3 8B tok/sMax Model (Q4)Best ForRelative Performance
RTX 3050 6GB6 GB~18 tok/s~5BTesting & prototyping
7%
RTX 4060 8GB8 GB~52 tok/s~7BSmall model inference
21%
RTX 5060 8GB8 GB~60 tok/s~7BBudget Blackwell inference
24%
RTX 4060 Ti 16GB16 GB~68 tok/s13BMid-range with extra VRAM
28%
Arc Pro B70 32GB32 GB~70 tok/s33BLarge VRAM on a budget
29%
Ryzen AI MAX+ 395128 GB shared~80 tok/s70B+Huge models via unified memory
33%
RTX 3090 24GB24 GB~85 tok/s33BBest value all-rounder
35%
RX 9070 XT 16GB16 GB~95 tok/s13BFast AMD for 7B–13B
39%
Radeon AI Pro R970032 GB~110 tok/s70B Q2Pro AMD — speed + VRAM
45%
RTX 5080 16GB16 GB~140 tok/s13BFastest for 7B–13B models
57%
RTX 5090 32GB32 GB~220 tok/s70B Q2Fastest for production APIs
90%
RTX 6000 PRO 96GB96 GB~245 tok/s405B Q2Enterprise — largest models + fastest 8B
100%
Benchmark Methodology: All figures are estimates based on single-GPU, single-user inference at Q4_K_M quantisation using Ollama. The reference model is LLaMA 3 8B for all GPUs. Real-world throughput varies with concurrent users, context length, system RAM, cooling, and model architecture. Relative performance is normalised to the fastest GPU (RTX 6000 PRO = 100%). The “Max Model (Q4)” column shows the largest parameter model that fits entirely in each GPU’s VRAM at 4-bit quantisation.

What Affects Token Throughput?

Token speed isn’t determined by the GPU alone. Understanding these factors will help you get the most from your hardware.

VRAM Capacity

The model must fit entirely in GPU memory for maximum speed. When a model exceeds VRAM, layers spill to system RAM and throughput drops dramatically — often by 5–10×.

Quantisation Level

Q4_K_M (4-bit) is the most common quantisation for inference. Lower bit depths (Q2, Q3) reduce quality but fit larger models. FP16 doubles VRAM usage but gives best output quality.

Memory Bandwidth

LLM inference is memory-bandwidth bound, not compute bound. GPUs with higher GB/s bandwidth (like RTX 5090’s 1,792 GB/s) generate tokens faster regardless of raw TFLOPS.

Concurrent Users

Benchmarks show single-user throughput. With multiple concurrent requests, total throughput increases but per-user speed decreases. Frameworks like vLLM optimise for concurrency.

Context Length

Longer context windows consume more VRAM and slow down inference. A 32K context request will be noticeably slower than a 2K context request on the same GPU and model.

Inference Runtime

Different inference engines (Ollama, vLLM, llama.cpp, TGI) have different performance characteristics. vLLM typically excels at high-concurrency serving; Ollama is easiest for single-user setups.

Model Architecture

Not all models of the same parameter count perform equally. Architecture choices like grouped-query attention (GQA), mixture-of-experts (MoE), and sliding-window attention all affect inference speed. A 7B MoE model can be faster than a dense 7B model at the same quality.

Thermal Throttling

GPUs reduce clock speeds when temperatures exceed safe limits. Sustained inference workloads generate continuous heat — proper server cooling and airflow prevent thermal throttling and maintain consistent throughput over long periods.

Prompt Length vs Generation

Processing the input prompt (prefill) and generating output tokens (decode) use the GPU differently. Long prompts create a larger initial delay before the first token appears, while decode speed determines how fast the response streams. A 10K token prompt takes significantly longer to prefill than a 100 token prompt on the same GPU.

System RAM & NVMe

While VRAM handles the model weights, system RAM and NVMe speed affect model loading times, KV cache overflow, and context management. 128GB DDR5 ensures headroom for large context windows and prevents swap-to-disk bottlenecks during inference.

KV Cache Size

The key-value cache stores attention state for each token in the context. Larger context windows and higher batch sizes rapidly increase KV cache memory usage — often consuming as much VRAM as the model weights themselves at 32K+ context lengths.

Speculative Decoding

Speculative decoding uses a small “draft” model to predict multiple tokens ahead, then verifies them on the full model in a single pass. When predictions are accurate, this can boost effective tok/s by 2–3× without any quality loss — but it requires extra VRAM for the draft model.

Top GPU Picks by Use Case

Not sure which GPU you need? Here are our top recommendations based on common inference workloads.

RTX 3090 · 24GBBest Value
ArchitectureAmpere
VRAM24 GB GDDR6X
Bandwidth936 GB/s
FP3235.6 TFLOPS
~85
tok/s · LLaMA 3 8B Q4Runs 33B models at Q4
From £139.00/mo
Configure
RTX 6000 PRO · 96GBEnterprise
ArchitectureBlackwell 2.0
VRAM96 GB GDDR7
Bandwidth1,536 GB/s
FP32126.0 TFLOPS
~245
tok/s · LLaMA 3 8B Q4Fits 405B at Q2 — fastest on 8B
From £899.00/mo
Configure
Radeon AI Pro R9700 · 32GBBest AMD
ArchitectureRDNA 4
VRAM32 GB GDDR6
Bandwidth645 GB/s
FP3247.8 TFLOPS
~110
tok/s · LLaMA 3 8B Q432GB VRAM — fits 70B at Q2
From £199.00/mo
Configure
RTX 4060 Ti · 16GBBudget 13B
ArchitectureAda Lovelace
VRAM16 GB GDDR6
Bandwidth288 GB/s
FP3222.1 TFLOPS
~68
tok/s · LLaMA 3 8B Q416GB fits 13B models at Q4
From £99.00/mo
Configure

Token throughput figures are rough estimates under single-user, single-GPU conditions at Q4_K_M quantisation. Real-world performance varies significantly with concurrent requests, context length, cooling, and configuration.

How to Choose a GPU Based on Benchmarks

Matching your workload to the right hardware doesn’t have to be complicated. Follow these guidelines.

Prototyping & Dev

Running 3B–7B models for testing? An RTX 4060 or RTX 5060 gives you 50–60 tok/s at under £90/mo. Enough for local development without breaking the bank.

Production API (7B–13B)

For customer-facing APIs running models like Mistral 7B or LLaMA 3 8B, the RTX 5080 at 140 tok/s gives excellent per-user response times with room for concurrency.

Large Models (33B–70B)

Models above 13B need 24GB+ VRAM. The RTX 3090 (24GB) handles 33B at Q4. For 70B, look at the RTX 5090 (32GB) or Radeon AI Pro R9700 (32GB).

Enterprise (70B–405B)

Running LLaMA 3.1 405B or similar? The RTX 6000 PRO with 96GB of VRAM is the only single-GPU option that fits these models without multi-GPU setups.

Frequently Asked Questions

A token is roughly ¾ of a word. Tokens per second (tok/s) measures how many tokens a GPU can generate in one second during inference. At 85 tok/s a model produces roughly 64 words per second — fast enough for real-time chat. At 220 tok/s the same model produces about 165 words per second.
Q4_K_M is the most widely used quantisation level for production LLM inference. It offers the best balance of output quality and VRAM efficiency — you get very close to FP16 quality at roughly half the memory usage. It’s the default for most Ollama and llama.cpp deployments.
These are single-user, single-GPU estimates under controlled conditions. Real-world throughput varies depending on the model architecture, context window length, number of concurrent users, inference runtime (Ollama vs vLLM vs TGI), system RAM, and cooling. Use these numbers as a reliable comparative baseline between GPUs rather than absolute production figures.
Ollama is optimised for simplicity and single-user inference — great for development and personal use. vLLM is designed for high-throughput serving with features like continuous batching, PagedAttention, and speculative decoding. For production APIs with multiple concurrent users, vLLM typically delivers higher aggregate throughput.
Technically yes — Ollama and llama.cpp support partial offloading to system RAM. However, any layers running on CPU/RAM will be dramatically slower (often 5–10× slower). For best performance, choose a GPU whose VRAM can fit your model entirely. Check our GPU hosting plans to find the right VRAM tier.
AMD GPUs using ROCm have made significant strides. The RX 9070 XT (95 tok/s) and Radeon AI Pro R9700 (110 tok/s) are competitive with their NVIDIA counterparts and often offer more VRAM per pound. Software support is slightly narrower — Ollama, vLLM, and llama.cpp all work well, but some niche frameworks may still have NVIDIA-only dependencies.
The Ryzen AI MAX+ 395 is unique — it’s an APU with 128GB of unified shared memory. While its per-token speed (~80 tok/s) is moderate, it can run models up to 70B+ parameters without a discrete GPU. It’s ideal if you need to fit very large models on a single system where VRAM would otherwise be the bottleneck.
Yes. For workloads requiring more throughput or the ability to run 70B+ models at full Q4 quality, we offer multi-GPU configurations. Contact our sales team for custom quotes on dual-GPU and quad-GPU setups.
For a responsive chat experience, aim for at least 30–40 tok/s per concurrent user. At that speed, text appears on screen as fast as a user can comfortably read it. If you’re serving multiple concurrent users, multiply accordingly — for example, 10 concurrent users at 40 tok/s each requires a GPU capable of ~400 aggregate tok/s, which is where vLLM’s batching or a multi-GPU setup becomes important.
Tokens per second measures generation speed — how quickly the GPU produces output tokens once it starts generating. Time to first token (TTFT) measures the initial latency — how long the user waits before the first word appears. TTFT depends on prompt length (longer prompts take longer to process) and is generally more noticeable to end users. Both matter for a good experience, but tok/s determines overall response completion time.
At moderate to high usage levels, yes — significantly. Cloud APIs charge per token (typically £1–£15 per million tokens for output). A dedicated GPU server at a fixed monthly price gives you unlimited tokens. The break-even point depends on your usage volume, but most customers processing more than a few hundred thousand tokens per day find self-hosting cheaper. Use our LLM cost calculator to compare for your specific workload.
Inference speed is determined by model architecture and parameter count, not by whether the model has been fine-tuned. A fine-tuned LLaMA 3 8B runs at the same tok/s as the base LLaMA 3 8B — the weights are different but the computation is identical. However, LoRA adapters add a small overhead (typically 1–5%), and merged models with additional layers may be slightly different in size.
Batch size has a major impact on aggregate throughput. Our benchmarks use a batch size of 1 (single user). Increasing the batch size — serving multiple requests simultaneously — uses more VRAM but improves total tok/s because the GPU can process multiple sequences in parallel. Frameworks like vLLM handle this automatically with continuous batching. A GPU producing 140 tok/s for one user might produce 400+ aggregate tok/s across 8 concurrent requests.
It depends on your priorities. Q4_K_M is the most popular choice — it delivers near-FP16 quality at roughly half the VRAM usage and fastest inference speed. Q5_K_M offers slightly better quality with marginally more VRAM. Q8 and FP16 are noticeably higher quality but require 2–4× the VRAM, meaning you may need to drop to a smaller model. For most production use cases, Q4_K_M is the sweet spot.

Available on all servers

  • 1Gbps Port
  • NVMe Storage
  • 128GB DDR4/DDR5
  • Any OS
  • 99.9% Uptime
  • Root/Admin Access

Our dedicated GPU servers provide full hardware resources and a dedicated GPU card, ensuring unmatched performance and privacy. Perfect for LLM inference, AI model serving, fine-tuning, and any deep learning workload — with no shared resources and no token fees.

Get in Touch

Not sure which GPU is right for your inference workload? Our team can help you match your model size and throughput requirements to the ideal server configuration.

Contact Sales →

Or browse the knowledgebase for setup guides on Ollama, vLLM, and more.

Ready to Deploy? Choose Your GPU

Flat monthly pricing. Full GPU resources. UK data centre. Pick the GPU that matches your throughput needs and deploy in under an hour.

Have a question? Need help?