RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5060 Ti 16GB Qwen 2.5 14B Benchmark
Benchmarks

RTX 5060 Ti 16GB Qwen 2.5 14B Benchmark

Qwen 2.5 14B AWQ on Blackwell 16GB - measured decode, prefill, concurrency numbers on the tightest fit of the 14B class.

Qwen 2.5 14B via AWQ or GPTQ INT4 is the biggest model that serves cleanly on 16 GB. Here are the full benchmarks on the RTX 5060 Ti 16GB at our hosting.

Contents

Setup

  • Model: Qwen/Qwen2.5-14B-Instruct-AWQ
  • 48 layers, 8 KV heads (GQA), 128 head dim
  • vLLM 0.6.4, Marlin AWQ kernels, FlashAttention 2.6

Decode Throughput

128 in / 512 out, batch 1:

PrecisionVRAM (weights)t/sMax context
FP1628 GBDoes not fit
FP814 GBDoes not fit with KV
AWQ INT4 (Marlin)9.0 GB6816,384
AWQ INT4 + FP8 KV9.0 GB7032,768
GPTQ INT49.2 GB6516,384
GGUF Q4_K_M8.8 GB5516,384
EXL2 4.0 bpw8.2 GB7516,384

14B decode on 16 GB is memory-bandwidth bound at 70 t/s – less than the 112 t/s from 8B but MMLU is meaningfully higher (~74 vs 68).

Prefill Throughput

  • AWQ INT4: 2,100 input t/s
  • GPTQ INT4: 2,000 input t/s
  • GGUF Q4_K_M: 1,600 input t/s
  • EXL2 4.0 bpw: 2,600 input t/s

Concurrency

UsersTotal t/s (AWQ+FP8 KV)Per userp99 TTFT
17070340 ms
212562420 ms
420551560 ms
828035850 ms
16320201,600 ms

Drops to batch 16 effectively because KV cache is tight. For production concurrency beyond 8, move to 5080 16 GB or 3090 24 GB.

Context Length

Qwen 2.5 14B native context is 128k (YaRN extended). Practical budgets on 16 GB:

  • AWQ + FP16 KV: 16k max
  • AWQ + FP8 KV: 32k max
  • AWQ + FP8 KV + max-num-seqs 1: 64k possible

Verdict

Qwen 2.5 14B AWQ + FP8 KV at 32k is the sweet spot. Strong model (beats Llama 3 8B on reasoning, code, multilingual), solid 70 t/s decode, reasonable concurrency up to 8 users.

Qwen 2.5 14B on Blackwell 16GB

Biggest model that fits, 70 t/s decode. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: monthly cost, max model size, AWQ guide, FP8 KV cache, Llama 3 8B comparison.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?