One hundred tokens per second. That is the magic number for LLM inference where the GPU stops being the bottleneck and network latency starts mattering more. The RTX 5090 hits that mark running LLaMA 3 8B at full FP16 precision, making it the first consumer-class GPU where an 8B parameter model genuinely feels like a cloud API in terms of responsiveness. We tested it on GigaGPU dedicated servers to see what 32 GB of Blackwell VRAM can really do.
The 100 tok/s Milestone
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 100 tok/s |
| Tokens/sec (batched, bs=8) | 160.0 tok/s |
| Per-token latency | 10.0 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 32K |
| Performance rating | Excellent |
Benchmark conditions: single-stream generation, 512-token prompt, 256-token completion, llama.cpp or vLLM backend. GGUF Q4_K_M via llama.cpp or vLLM FP16.
Ten milliseconds per token means responses appear practically as fast as your client can render them. At batch size 8, the 5090 sustains 160 tok/s — enough to serve a dozen concurrent users without any of them experiencing noticeable lag. The Blackwell architecture’s memory bandwidth improvements and enlarged tensor core count are doing exactly what they were designed for here.
Room to Spare
| Component | VRAM |
|---|---|
| Model weights (FP16) | 16.8 GB |
| KV cache + runtime | ~2.5 GB |
| Total RTX 5090 VRAM | 32 GB |
| Free headroom | ~15.2 GB |
With 15.2 GB of free VRAM after loading the model, the 5090 is almost comically over-provisioned for LLaMA 3 8B. You get full 32K context support and enough space to run multiple concurrent conversations with large KV caches. That spare capacity also means you could load a second smaller model alongside LLaMA 3, or use the headroom for speculative decoding to push throughput even higher.
Premium Pricing, Premium Throughput
| Cost Metric | Value |
|---|---|
| Server cost | £1.50/hr (£299/mo) |
| Cost per 1M tokens | £4.167 |
| Tokens per £1 | 239981 |
| Break-even vs API | ~1 req/day |
At £4.17 per million tokens, the 5090 is actually less cost-efficient on a per-token basis than the RTX 3090 (£3.36) or the RTX 5080 (£3.22). The higher £299/month price reflects the premium for flagship performance. With batching, costs drop to about £2.60 per million tokens. You justify this card not on token economics alone but on throughput and headroom — when you need guaranteed low latency at scale. See our tokens-per-second benchmark for detailed GPU comparisons.
When Overkill Is the Right Call
For LLaMA 3 8B specifically, the RTX 5090 is more GPU than most deployments need. However, it makes strategic sense if you plan to scale up to larger models later (the 32 GB handles 13B+ models at FP16) or if you need the absolute lowest latency possible for customer-facing applications. It is also the natural choice for teams running multiple LLMs on a single server.
Quick deploy:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-8b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
Full details in our LLaMA hosting guide. Compare GPUs in our best GPU for LLaMA article, or see the DeepSeek 7B on RTX 5090 for an alternative model. Browse all benchmarks.
Flagship LLaMA 3 Performance
100 tok/s, 32K context, 15 GB headroom. The RTX 5090 leaves nothing on the table.
Order RTX 5090 Server