RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5060 Ti 16GB Llama 3 8B Benchmark
Benchmarks

RTX 5060 Ti 16GB Llama 3 8B Benchmark

Full Llama 3.1 8B benchmark on Blackwell 16GB - FP16, FP8, AWQ across batch sizes, prompt lengths, and concurrency.

Llama 3.1 8B is the most common model to serve on 16 GB. These numbers were measured on the RTX 5060 Ti 16GB at our dedicated GPU hosting with vLLM 0.6.4, CUDA 12.6, driver 560.

Contents

Setup

  • GPU: RTX 5060 Ti 16GB (Blackwell, 448 GB/s, 180 W)
  • Host: Ryzen 9 7950X, 64 GB DDR5-5600, Gen4 NVMe
  • vLLM 0.6.4, CUDA 12.6, PyTorch 2.5, FlashAttention 2.6
  • Model: meta-llama/Llama-3.1-8B-Instruct
  • Benchmark tool: vLLM’s benchmark_throughput.py + custom latency script

Decode Throughput

Steady-state output tokens/sec at 128 input + 512 output, batch 1:

PrecisionWeightst/s (batch 1)VRAM used
FP1616.0 GBDoes not fit
FP8 E4M38.0 GB10811.4 GB
FP8 + FP8 KV8.0 GB1129.8 GB
AWQ INT4 (Marlin)5.5 GB1358.2 GB
GPTQ INT4 (Marlin)5.6 GB1328.3 GB
GGUF Q4_K_M (llama.cpp)4.9 GB957.6 GB
EXL2 4.0 bpw (TabbyAPI)4.8 GB1457.2 GB

AWQ and EXL2 beat FP8 at batch 1 because INT4 matmul via Marlin kernels is memory-bandwidth-bound, and halving weight size halves the bandwidth hit. At higher batch sizes FP8 catches up and passes INT4.

Prefill Throughput

Input tokens processed per second, single sequence:

PrecisionPrefill t/s
FP86,800
FP8 + FP8 KV6,900
AWQ INT44,200
GGUF Q4_K_M3,100
EXL2 4.0 bpw5,200

FP8 dominates prefill because prefill is compute-bound and Blackwell’s FP8 tensor cores hit peak on large GEMMs.

Concurrency Scaling

FP8 + FP8 KV, 256 in / 512 out, --max-num-seqs auto:

Concurrent usersTotal t/sPer-user t/sp99 TTFT
1112112180 ms
2205103220 ms
435589310 ms
851064480 ms
1664040780 ms
32720221,450 ms

Aggregate throughput plateaus around 700-720 t/s at batch 32 – hitting memory bandwidth ceiling on Blackwell 16GB.

Time to First Token

Batch 1, FP8:

  • 256-token prompt: 120 ms TTFT
  • 2,048-token prompt: 380 ms TTFT
  • 8,192-token prompt: 1,550 ms TTFT
  • 32,768-token prompt: 7,100 ms TTFT (use chunked prefill)

Verdict

Llama 3.1 8B FP8 + FP8 KV at 32k context is the default serving config for this card. ~112 t/s single-user, ~700 t/s aggregate at 32 concurrent. For single-user maximum speed, AWQ or EXL2 4.0 bpw. For long-context, stick with FP8 + FP8 KV cache. Add speculative decoding for interactive chat; prefix caching for fixed system prompts.

Llama 3 8B on Blackwell 16GB

112 t/s solo, 700 t/s aggregate. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: deployment guide, vs RTX 3090.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?