Sixty-two tokens per second. That is faster than most people can read, and it is what the RTX 3090 delivers running LLaMA 3 8B at full FP16 precision. The 3090 remains one of the most compelling GPUs for self-hosted LLM inference: its 24 GB of VRAM and 936 GB/s memory bandwidth give it headroom that newer mid-range cards simply cannot match. We ran the numbers on GigaGPU dedicated servers.
Raw Performance Numbers
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 62 tok/s |
| Tokens/sec (batched, bs=8) | 99.2 tok/s |
| Per-token latency | 16.1 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 32K |
| Performance rating | Excellent |
Benchmark conditions: single-stream generation, 512-token prompt, 256-token completion, llama.cpp or vLLM backend. GGUF Q4_K_M via llama.cpp or vLLM FP16.
The 3090’s massive 936 GB/s memory bandwidth is the key factor here. LLM inference is almost entirely memory-bandwidth-bound for single-stream generation, and the 3090’s 384-bit memory bus feeds tokens at a rate that puts even the newer RTX 4060 Ti to shame. Batched at bs=8, it pushes past 99 tok/s, approaching the 100 tok/s mark that makes it viable for multi-user API serving.
Generous VRAM Headroom
| Component | VRAM |
|---|---|
| Model weights (FP16) | 16.8 GB |
| KV cache + runtime | ~2.5 GB |
| Total RTX 3090 VRAM | 24 GB |
| Free headroom | ~7.2 GB |
This is where the 3090 really separates itself. After loading the full FP16 model, you still have 7.2 GB free. That translates to 32K context length support and room for multiple concurrent KV caches. Unlike the 4060 Ti where FP16 runs at the absolute memory limit, the 3090 lets you run full precision comfortably with plenty of breathing room for production workloads.
Cost Analysis
| Cost Metric | Value |
|---|---|
| Server cost | £0.75/hr (£149/mo) |
| Cost per 1M tokens | £3.360 |
| Tokens per £1 | 297619 |
| Break-even vs API | ~1 req/day |
At £3.36 per million tokens, the 3090 offers the best per-token economics of any card in the mid-range bracket for LLaMA 3 8B. Batching drops that to about £2.10 — firmly below even the cheapest commercial API pricing. The £149/month cost is higher in absolute terms, but the throughput makes every pound work harder. See how this stacks up on our full benchmark comparison and cost-per-million-tokens calculator.
Production-Ready Performance
The RTX 3090 is the point where LLaMA 3 8B stops being a development toy and becomes a production-grade inference engine. The combination of 62 tok/s throughput, 32K context, and ample memory headroom means you can build real products on this hardware — chatbots, document analysis tools, code assistants — without worrying about hitting walls.
Quick deploy:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-8b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
Read the full LLaMA hosting guide or our GPU comparison for LLaMA. Compare against the DeepSeek 7B on RTX 3090, or see all benchmark results.