NVIDIA’s Blackwell architecture brings a genuine generational leap to consumer GPUs, and the numbers back it up. The RTX 5080 pushes LLaMA 3 8B to 82 tokens per second at FP16 — a 32% improvement over the RTX 3090 despite having 8 GB less VRAM. But there is a catch, and it is worth understanding before you commit to this card for inference on GigaGPU dedicated servers.
Blackwell in Action
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 82 tok/s |
| Tokens/sec (batched, bs=8) | 131.2 tok/s |
| Per-token latency | 12.2 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 8K |
| Performance rating | Excellent |
Benchmark conditions: single-stream generation, 512-token prompt, 256-token completion, llama.cpp or vLLM backend. GGUF Q4_K_M via llama.cpp or vLLM FP16.
At 12.2 ms per token, responses feel instantaneous. The 5080’s improved memory subsystem and tensor core efficiency squeeze out 82 tok/s from FP16 weights, and batched inference reaches 131.2 tok/s — comfortably past the threshold for serving multiple concurrent users with imperceptible delay.
The VRAM Trade-off
| Component | VRAM |
|---|---|
| Model weights (FP16) | 16.8 GB |
| KV cache + runtime | ~2.5 GB |
| Total RTX 5080 VRAM | 16 GB |
| Free headroom | ~0.0 GB |
Here is the compromise. The 5080 only has 16 GB of VRAM, meaning FP16 LLaMA 3 8B fills it completely. You are limited to 8K context and there is no room for concurrent request KV caches. Compared to the 3090’s comfortable 7.2 GB of headroom with the same model, this is a tight squeeze. If your workload needs longer contexts, you would either quantise to 4-bit (which would still be very fast on this hardware) or move up to the RTX 5090.
Cost Breakdown
| Cost Metric | Value |
|---|---|
| Server cost | £0.95/hr (£189/mo) |
| Cost per 1M tokens | £3.218 |
| Tokens per £1 | 310752 |
| Break-even vs API | ~1 req/day |
The 5080 edges out even the RTX 3090 on per-token cost at £3.22 per million tokens. Batched, it drops to roughly £2.01 — the lowest in the non-flagship range. At £189/month it is a premium over the 3090, but you are paying for raw speed and modern architecture efficiency. Check our tokens-per-second benchmark and cost calculator for the full picture.
Speed vs. Flexibility
The RTX 5080 is the right choice when throughput matters more than context length. For chat applications, code completion, and short-form generation where 8K context is sufficient, it is the fastest option under £200/month. If you need 32K context for document analysis or RAG workflows, the RTX 3090 actually serves you better despite being slower per-token.
Quick deploy:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-8b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
See our LLaMA hosting guide and best GPU for LLaMA roundup. Compare with DeepSeek 7B on RTX 5080, or browse all benchmarks.
Maximum LLaMA 3 Speed
82 tok/s on Blackwell architecture. Purpose-built for low-latency inference.
Order RTX 5080 Server