RTX 3050 - Order Now
Home / Blog / Benchmarks / LLaMA 3 70B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-70b-on-rtx-5090-benchmark, Excerpt: LLaMA 3 70B benchmarked on RTX 5090: 12.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>
Benchmarks

LLaMA 3 70B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-70b-on-rtx-5090-benchmark, Excerpt: LLaMA 3 70B benchmarked on RTX 5090: 12.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

LLaMA 3 70B benchmarked on RTX 5090: 12.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

The RTX 5090’s 32 GB of VRAM is the largest frame buffer available on a single consumer GPU. For a model as demanding as Meta’s LLaMA 3 70B, that extra memory over the 3090 makes a tangible difference — though this pairing still operates near the hardware’s limits. We benchmarked it on a GigaGPU dedicated server.

Performance Overview

MetricValue
Tokens/sec (single stream)12.8 tok/s
Tokens/sec (batched, bs=8)20.5 tok/s
Per-token latency78.1 ms
PrecisionINT4
Quantisation4-bit GGUF Q4_K_M
Max context length8K
Performance ratingAcceptable

512-token prompt, 256-token completion, single-stream, llama.cpp Q4_K_M. The 5090 delivers 2.5x the throughput of the RTX 3090 (5.2 tok/s) for this same model, thanks to both the larger VRAM envelope and faster memory bandwidth.

VRAM: Still Tight, But Manageable

ComponentVRAM
Model weights (4-bit GGUF Q4_K_M)31 GB
KV cache + runtime~4.6 GB
Total RTX 5090 VRAM32 GB
Free headroom~1.0 GB

LLaMA 3 70B at 4-bit still occupies 31 GB — 97% of the 5090’s capacity. The critical difference versus the 3090 is that more model layers fit on-GPU, reducing the amount of CPU offloading and cutting latency from 192 ms to 78 ms per token. Context extends to 8K (double the 3090), making multi-turn conversations more practical. But do not expect to run anything else alongside this model.

Cost Considerations

Cost MetricValue
Server cost£1.50/hr (£299/mo)
Cost per 1M tokens£32.552
Tokens per £130,720
Break-even vs API~1 req/day

£32.55 per million tokens is high compared to smaller models, but running a 70B model on any hardware is expensive. Batching at bs=8 brings the effective rate to roughly £20.35/M. Compare that to commercial LLaMA 70B API endpoints and the self-hosting economics become more favourable at scale. Use the cost calculator to model your specific volume.

Practical Advice

At 12.8 tok/s, LLaMA 3 70B on the 5090 is usable for internal tools, development evaluation, and low-concurrency applications. It is not fast enough for consumer-facing chat at scale. If 70B quality is non-negotiable for your use case, this is the best single-GPU option available. Otherwise, consider whether LLaMA 3 8B or Mistral 7B at higher throughput achieves what you need.

Deploy command:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-70b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

Full guidance in the LLaMA hosting guide. Also see: best GPU for LLM inference, all benchmarks, tok/s tool.

LLaMA 3 70B on a Single RTX 5090

The most VRAM you can get on one consumer card. UK datacentre, flat pricing, root access.

Order RTX 5090 Server

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?