Home / Blog / Benchmarks / RTX 5080: Maximum LLM Throughput (Requests/sec)

Benchmarks

RTX 5080: Maximum LLM Throughput (Requests/sec)

Maximum LLM throughput benchmarks for the RTX 5080 — requests per second at batch sizes 1 to 64 with popular 7-8B models on vLLM continuous batching.

Benchmarks April 17, 2026 2 min read gigagpu

RTX 5080 Maximum LLM Throughput

Squeezing maximum tok/s from a single RTX 5080 16GB. Tuned vLLM numbers across models.

Table of Contents

Setup
Results
Limits
Verdict

The 5080 sits between 5060 Ti and 4090 on the lineup — same 16 GB VRAM as the 5060 Ti but ~2× the bandwidth and Blackwell tensor cores. For 7B-class production it punches above its 16 GB tier; for 13B+ it bumps into VRAM limits.

TL;DR

Tuned 5080 vLLM throughput on 7B FP8: ~1,050 tok/s aggregate batch 16, ~1,250 tok/s batch 32. Sustained ~50 concurrent users at p99 TTFT < 2 s. 14B AWQ: ~370 tok/s batch 8 — VRAM-tight. £189/mo. Best £/throughput in the 16 GB tier when concurrency matters.

Setup

vLLM 0.6.4 with FP8 (E4M3) weights
Prefix caching enabled, FP8 KV cache
Tuned max-num-seqs per scenario (typically 32-48)
Tuned max-num-batched-tokens = 16384

Results

Model	Batch 1	Batch 16	Batch 32 (tuned)	Concurrent users (p99 TTFT < 2s)
Mistral 7B FP8	~165 tok/s	~1,050 tok/s	~1,250 tok/s	~50
Llama 3.1 8B FP8	~150 tok/s	~990 tok/s	~1,180 tok/s	~45
Qwen 2.5 7B FP8	~155 tok/s	~1,000 tok/s	~1,200 tok/s	~48
Phi-3 Medium 14B AWQ	~75 tok/s	~370 tok/s	N/A (VRAM)	~12
Qwen 2.5 14B AWQ	~70 tok/s	~350 tok/s	N/A (VRAM)	~10

Limits

Where the 5080 hits walls:

13B-14B FP8: doesn't fit (~14 GB weights + KV cache overflow). AWQ-INT4 fits but with limited concurrency.
Long context with 14B: 32K context on 14B AWQ-INT4 leaves <1 GB KV cache headroom.
Co-hosted multi-model: 16 GB doesn't leave room for embeddings + LLM + reranker on the same card.

For these, step up to 4090 24GB or 5090 32GB.

Verdict

For pure 7B FP8 production at moderate-to-high concurrency, the 5080 16GB is the £/throughput sweet spot — better than 5060 Ti for concurrency-anchored workloads, cheaper than 4090 when you don't need 24 GB. £189/mo + ~1,250 tok/s tuned = strong cost economics for SaaS production.

Bottom line

5080 = 7B production sweet spot when concurrency matters. See 5080 concurrency.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 5080: Maximum LLM Throughput (Requests/sec)

RTX 5080 Maximum LLM Throughput

Setup

Results

Limits

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 5080: Maximum LLM Throughput (Requests/sec)

RTX 5080 Maximum LLM Throughput

Setup

Results

Limits

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

RAG Pipeline End-to-End Latency by GPU

YOLOv8 on RTX 5090: Detection FPS & Cost, Category: Benchmarks, Slug: yolov8-on-rtx-5090-benchmark, Excerpt: YOLOv8 benchmarked on RTX 5090: 165 FPS, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

Flux.1 Images/sec by GPU

Mistral 7B Tokens/sec by GPU (Full Benchmark)

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?