Table of Contents
LLaMA 3 8B Scaling Overview
LLaMA 3 8B is one of the most widely deployed open-weight models for self-hosted inference. Understanding how its throughput scales with concurrent requests is essential for capacity planning — you need to know at what point adding more users degrades individual response quality. We tested LLaMA 3 8B (INT4, GPTQ) at concurrency levels from 1 to 64 across four GPUs using vLLM continuous batching.
All tests ran on GigaGPU bare-metal servers with 128-token prompts and 256-token outputs. We measured aggregate requests per second and per-request end-to-end latency. For single-user token speed, see the tokens per second benchmark.
Throughput by Concurrency Level
| Concurrency | RTX 4060 (req/s) | RTX 3090 (req/s) | RTX 5080 (req/s) | RTX 5090 (req/s) |
|---|---|---|---|---|
| 1 | 0.09 | 0.24 | 0.36 | 0.52 |
| 4 | 0.30 | 0.85 | 1.30 | 1.90 |
| 8 | 0.48 | 1.52 | 2.35 | 3.45 |
| 16 | 0.65 | 2.60 | 3.90 | 5.80 |
| 32 | 0.72 | 3.80 | 5.40 | 8.20 |
| 64 | OOM | 4.50 | 6.20 | 9.80 |
Throughput scales near-linearly up to concurrency 16 on most GPUs, then flattens as memory bandwidth saturates. The RTX 4060 runs out of memory before reaching concurrency 64, while the RTX 5090 sustains 9.8 req/s. For maximum throughput numbers, see the RTX 5090 throughput benchmark.
Latency at Each Concurrency Level
As concurrency increases, per-request end-to-end latency rises because each request shares GPU compute with others.
| Concurrency | RTX 4060 (e2e p50) | RTX 3090 (e2e p50) | RTX 5080 (e2e p50) | RTX 5090 (e2e p50) |
|---|---|---|---|---|
| 1 | 11.2 s | 4.2 s | 2.8 s | 1.9 s |
| 4 | 13.5 s | 4.7 s | 3.1 s | 2.1 s |
| 8 | 16.8 s | 5.3 s | 3.4 s | 2.3 s |
| 16 | 24.5 s | 6.2 s | 4.1 s | 2.8 s |
| 32 | 44.0 s | 8.4 s | 5.9 s | 3.9 s |
| 64 | OOM | 14.2 s | 10.3 s | 6.5 s |
On the RTX 3090, per-request latency roughly doubles from concurrency 1 to 16 (4.2 s to 6.2 s), which is manageable. From 16 to 64 it triples — this is the region where you start noticeably degrading user experience. The RTX 3090 concurrent users guide translates these numbers into practical user capacity.
GPU Comparison
The RTX 5090 maintains the lowest per-request latency at every concurrency level, with 6.5 s at concurrency 64 compared to the RTX 3090’s 14.2 s. The RTX 3090 remains the best value for moderate concurrency (up to 16 users), while the RTX 5080 occupies a middle ground with strong performance and a lower price than the 5090.
For cost-adjusted comparisons, see the RTX 3090 vs RTX 5090 throughput per dollar and RTX 4060 vs RTX 3090 throughput per dollar analyses. Our batch size impact guide explains the underlying dynamics of how batch size affects tokens per second.
Finding the Optimal Operating Point
The optimal concurrency depends on your latency SLA. For chatbot applications targeting 5-second end-to-end response times, the sweet spots are: RTX 4060 at concurrency 1-2, RTX 3090 at concurrency 8-12, RTX 5080 at concurrency 16-20, and RTX 5090 at concurrency 32+. Operating below these levels wastes GPU capacity; operating above them degrades user experience.
For production deployment, use the vLLM production setup guide to configure continuous batching, and monitor actual p99 latency to adjust your concurrency limits. The LLM cost calculator can help you model the cost implications of different GPU choices at your target concurrency.
Conclusion
LLaMA 3 8B INT4 scales well from 1 to 64 concurrent requests on GPUs with sufficient VRAM. The throughput-latency trade-off is favourable up to concurrency 16-32 on mid-range and high-end cards. For model-specific comparisons, also see the DeepSeek concurrent throughput and Mistral 7B concurrent throughput benchmarks, or browse the full Benchmarks category.