The RTX 5090’s 32 GB of VRAM is the largest frame buffer available on a single consumer GPU. For a model as demanding as Meta’s LLaMA 3 70B, that extra memory over the 3090 makes a tangible difference — though this pairing still operates near the hardware’s limits. We benchmarked it on a GigaGPU dedicated server.
Performance Overview
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 12.8 tok/s |
| Tokens/sec (batched, bs=8) | 20.5 tok/s |
| Per-token latency | 78.1 ms |
| Precision | INT4 |
| Quantisation | 4-bit GGUF Q4_K_M |
| Max context length | 8K |
| Performance rating | Acceptable |
512-token prompt, 256-token completion, single-stream, llama.cpp Q4_K_M. The 5090 delivers 2.5x the throughput of the RTX 3090 (5.2 tok/s) for this same model, thanks to both the larger VRAM envelope and faster memory bandwidth.
VRAM: Still Tight, But Manageable
| Component | VRAM |
|---|---|
| Model weights (4-bit GGUF Q4_K_M) | 31 GB |
| KV cache + runtime | ~4.6 GB |
| Total RTX 5090 VRAM | 32 GB |
| Free headroom | ~1.0 GB |
LLaMA 3 70B at 4-bit still occupies 31 GB — 97% of the 5090’s capacity. The critical difference versus the 3090 is that more model layers fit on-GPU, reducing the amount of CPU offloading and cutting latency from 192 ms to 78 ms per token. Context extends to 8K (double the 3090), making multi-turn conversations more practical. But do not expect to run anything else alongside this model.
Cost Considerations
| Cost Metric | Value |
|---|---|
| Server cost | £1.50/hr (£299/mo) |
| Cost per 1M tokens | £32.552 |
| Tokens per £1 | 30,720 |
| Break-even vs API | ~1 req/day |
£32.55 per million tokens is high compared to smaller models, but running a 70B model on any hardware is expensive. Batching at bs=8 brings the effective rate to roughly £20.35/M. Compare that to commercial LLaMA 70B API endpoints and the self-hosting economics become more favourable at scale. Use the cost calculator to model your specific volume.
Practical Advice
At 12.8 tok/s, LLaMA 3 70B on the 5090 is usable for internal tools, development evaluation, and low-concurrency applications. It is not fast enough for consumer-facing chat at scale. If 70B quality is non-negotiable for your use case, this is the best single-GPU option available. Otherwise, consider whether LLaMA 3 8B or Mistral 7B at higher throughput achieves what you need.
Deploy command:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-70b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
Full guidance in the LLaMA hosting guide. Also see: best GPU for LLM inference, all benchmarks, tok/s tool.
LLaMA 3 70B on a Single RTX 5090
The most VRAM you can get on one consumer card. UK datacentre, flat pricing, root access.
Order RTX 5090 Server