Home / Blog / Benchmarks / LLaMA 3 8B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-8b-on-rtx-5080-benchmark, Excerpt: LLaMA 3 8B benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Benchmarks

LLaMA 3 8B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-8b-on-rtx-5080-benchmark, Excerpt: LLaMA 3 8B benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

LLaMA 3 8B benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration.

Benchmarks April 15, 2026 2 min read gigagpu

NVIDIA’s Blackwell architecture brings a genuine generational leap to consumer GPUs, and the numbers back it up. The RTX 5080 pushes LLaMA 3 8B to 82 tokens per second at FP16 — a 32% improvement over the RTX 3090 despite having 8 GB less VRAM. But there is a catch, and it is worth understanding before you commit to this card for inference on GigaGPU dedicated servers.

Blackwell in Action

Metric	Value
Tokens/sec (single stream)	82 tok/s
Tokens/sec (batched, bs=8)	131.2 tok/s
Per-token latency	12.2 ms
Precision	FP16
Quantisation	FP16
Max context length	8K
Performance rating	Excellent

Benchmark conditions: single-stream generation, 512-token prompt, 256-token completion, llama.cpp or vLLM backend. GGUF Q4_K_M via llama.cpp or vLLM FP16.

At 12.2 ms per token, responses feel instantaneous. The 5080’s improved memory subsystem and tensor core efficiency squeeze out 82 tok/s from FP16 weights, and batched inference reaches 131.2 tok/s — comfortably past the threshold for serving multiple concurrent users with imperceptible delay.

The VRAM Trade-off

Component	VRAM
Model weights (FP16)	16.8 GB
KV cache + runtime	~2.5 GB
Total RTX 5080 VRAM	16 GB
Free headroom	~0.0 GB

Here is the compromise. The 5080 only has 16 GB of VRAM, meaning FP16 LLaMA 3 8B fills it completely. You are limited to 8K context and there is no room for concurrent request KV caches. Compared to the 3090’s comfortable 7.2 GB of headroom with the same model, this is a tight squeeze. If your workload needs longer contexts, you would either quantise to 4-bit (which would still be very fast on this hardware) or move up to the RTX 5090.

Cost Breakdown

Cost Metric	Value
Server cost	£0.95/hr (£189/mo)
Cost per 1M tokens	£3.218
Tokens per £1	310752
Break-even vs API	~1 req/day

The 5080 edges out even the RTX 3090 on per-token cost at £3.22 per million tokens. Batched, it drops to roughly £2.01 — the lowest in the non-flagship range. At £189/month it is a premium over the 3090, but you are paying for raw speed and modern architecture efficiency. Check our tokens-per-second benchmark and cost calculator for the full picture.

Speed vs. Flexibility

The RTX 5080 is the right choice when throughput matters more than context length. For chat applications, code completion, and short-form generation where 8K context is sufficient, it is the fastest option under £200/month. If you need 32K context for document analysis or RAG workflows, the RTX 3090 actually serves you better despite being slower per-token.

Quick deploy:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/llama-3-8b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

See our LLaMA hosting guide and best GPU for LLaMA roundup. Compare with DeepSeek 7B on RTX 5080, or browse all benchmarks.

Maximum LLaMA 3 Speed

82 tok/s on Blackwell architecture. Purpose-built for low-latency inference.

Order RTX 5080 Server

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

LLaMA 3 8B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-8b-on-rtx-5080-benchmark, Excerpt: LLaMA 3 8B benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Blackwell in Action

The VRAM Trade-off

Cost Breakdown

Speed vs. Flexibility

Maximum LLaMA 3 Speed

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

LLaMA 3 8B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-8b-on-rtx-5080-benchmark, Excerpt: LLaMA 3 8B benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Blackwell in Action

The VRAM Trade-off

Cost Breakdown

Speed vs. Flexibility

Maximum LLaMA 3 Speed

Need a Dedicated GPU Server?

gigagpu

Related Articles

FP16 vs BF16 vs FP8 for AI Inference

RTX 3090: Maximum LLM Throughput (Requests/sec)

Thermal Throttling Impact on AI

Mixtral 8x7B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: mixtral-8x7b-on-rtx-3090-benchmark, Excerpt: Mixtral 8x7B benchmarked on RTX 3090: 18 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?