The 14-billion-parameter tier is where consumer GPUs separate the winners from the also-rans. FP16 weights alone are 28 GB – too large for any 24 GB card by some margin. The RTX 4090 24GB rewrites the calculus thanks to native FP8 tensor cores on Ada Lovelace: weights drop to 14 GB, FP8 KV cache fits 32k context comfortably, and Qwen 2.5 14B-Instruct becomes the strongest model that runs cleanly on a single consumer GPU – beating Llama 3.1 8B by 10+ MMLU points and matching Llama 3.1 70B on GSM8K. This guide on our UK GPU hosting walks through architecture, the FP8/AWQ deployment paths, throughput across batches, fine-tuning economics, and the production gotchas that bite teams in their first week.
Contents
- Why 14B is the 4090’s sweet spot
- Architecture and KV math
- VRAM budgets across precisions
- Throughput, latency and batched concurrency
- Quality benchmarks
- vLLM deployment and code
- Fine-tuning notes
- Production gotchas
- When to pick this over alternatives
- Verdict
Why 14B is the 4090’s sweet spot
14B parameters at FP8 use exactly 14 GB – 58% of the 24 GB envelope – leaving roughly 10 GB for KV cache, paged-attention blocks, activations and CUDA workspace. That is enough for 32k context at FP8 KV with no compromise, and it leaves a meaningful concurrency budget. AWQ INT4 brings weights to 9 GB, freeing 15 GB. No other 14-billion-parameter dense model packages this much capability into a memory footprint a 4090 can serve at production-grade throughput.
The deeper reason 14B is the sweet spot is that the 4090’s 1008 GB/s memory bandwidth is the binding constraint during decode. At FP8, Qwen 2.5 14B touches roughly 14 GB per token of weight reads, so the upper bound on decode is 1008 / 14 = 72 t/s ignoring KV and overhead – in practice we measure 110 t/s because vLLM pipelines KV reads efficiently and not every layer hits memory each step. Move to a 28 GB FP16 budget and you halve that ceiling.
Architecture and KV math
Qwen 2.5 14B-Instruct is a 48-layer transformer with 40 query heads, 8 key/value heads (a 5:1 GQA ratio), 128-dimensional heads, SwiGLU MLP at 13,824 intermediate dim, and an SLU vocabulary of 152,064 tokens. Native context is 32,768; YaRN extends to 131,072 but quality degrades faster than on the 7B – we recommend keeping production at 32k. License is the Qwen Community License, which is functionally Apache 2.0 for under-100M-MAU services and explicit about commercial use without paying.
KV cache per token at FP16: 8 KV heads x 128 dim x 48 layers x 2 (K+V) x 2 bytes = 196 KB. At FP8: 98 KB. So 32k context = 3.06 GB FP8. That is a small fraction of the 10 GB we have free after FP8 weights, leaving room for batch concurrency.
VRAM budgets across precisions
| Component | FP16 | FP8 (W8A8) | AWQ INT4 |
|---|---|---|---|
| Weights | 28.0 GB | 14.0 GB | 9.0 GB |
| KV @ 32k FP8, 1 seq | 3.06 GB | 3.06 GB | 3.06 GB |
| CUDA graphs + workspace | 1.6 GB | 1.6 GB | 1.6 GB |
| vLLM scheduler + activations | 1.5 GB | 1.5 GB | 1.5 GB |
| Total at batch 1, 32k | 34.2 GB (NO) | 20.2 GB | 15.2 GB |
| Headroom for paged KV | n/a | 3.8 GB | 8.8 GB |
| Concurrent users | Avg context | FP8 fits? | AWQ fits? |
|---|---|---|---|
| 16 | 2k | Yes (3.0 GB KV) | Yes |
| 16 | 8k | Yes (12.3 GB KV – tight) | Yes |
| 32 | 4k | Yes (12.3 GB KV – tight) | Yes |
| 4 | 32k | No (12.3 GB KV – too tight) | Yes |
Throughput, latency and batched concurrency
Single 4090, vLLM 0.6.3, FP8 KV, prompt 512, generate 256:
| Quant | Batch 1 (t/s) | Batch 8 | Batch 16 | Batch 32 | TTFT b1 (ms) |
|---|---|---|---|---|---|
| FP8 W8A8 | 110 | 540 | 720 | 820 | 140 |
| AWQ INT4 (Marlin) | 135 | 620 | 840 | 950 | 120 |
| GPTQ INT4 | 95 | 440 | 590 | 680 | 165 |
Compared with Qwen 2.5 7B FP8 (210 t/s batch 1), the doubling of parameters roughly halves throughput – the expected memory-bandwidth scaling. At batch 16, aggregate throughput plateaus at ~720 t/s on FP8 because we run out of KV headroom before we run out of compute. AWQ pushes the plateau higher because it leaves more VRAM. See the dedicated Qwen 14B benchmark for full curves.
Prefill at 8k prompt, batch 1: FP8 measures 3,100 t/s, AWQ 2,700 t/s. For RAG with long retrieved contexts, FP8 is the better default; for short-prompt chat, AWQ wins on decode.
Quality benchmarks
| Benchmark | Qwen 2.5 14B | Llama 3.1 8B | Mistral Nemo 12B | Llama 3.1 70B (ref) |
|---|---|---|---|---|
| MMLU | 79.7 | 69.4 | 68.0 | 83.6 |
| GSM8K | 90.2 | 84.5 | 74.9 | 95.1 |
| HumanEval | 86.6 | 72.6 | 50.0 | 80.5 |
| MATH | 55.6 | 30.0 | 27.5 | 68.0 |
| IFEval | 79.7 | 78.6 | 67.6 | 87.5 |
The GSM8K and HumanEval lines are the headline: Qwen 2.5 14B beats Llama 3.1 70B on HumanEval (86.6 vs 80.5) at one fifth the parameters, and matches it within 5 points on math. FP8 quantisation loss measured on these benchmarks is under 0.5 points.
vLLM deployment and code
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 16 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92
For the AWQ path with more concurrency:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.93
The Qwen chat template, applied programmatically:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
msgs = [
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Summarise the GQA mechanism."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n
Refer to our vLLM setup notes for CUDA toolchain prerequisites and the AWQ quantisation guide for INT4 calibration. Alternatives include TGI 2.x (within 8% of vLLM at this size), SGLang for tool-heavy agents, and llama.cpp Q4_K_M for laptop development.
Fine-tuning notes
QLoRA fine-tuning of Qwen 2.5 14B on a single 4090 is feasible at sequence length 2048, batch 2, gradient accumulation 16, rank 32. Expect 14 hours for a 50,000-sample SFT pass. For longer sequences (4k+), drop the rank to 16 or move to a multi-GPU node. See our QLoRA guide. Full fine-tuning needs ~120 GB of VRAM with optimiser states – rent two 4090s with FSDP or step up to an A100.
Production gotchas
- 32k context budget is tight at concurrency: 16 users at 8k average uses 12.3 GB of KV alone. If you see
NoFreeBlockserrors, either drop--max-num-seqsto 12 or move to AWQ. - FP8 KV scale on long context: at 32k with default e4m3 KV, attention sinks can lose precision. Switch to
fp8_e5m2for the KV dtype if you see degradation past 24k. - YaRN beyond 32k is risky: Qwen 2.5 14B’s 4x YaRN extension to 131k loses 6+ points on long-context retrieval benchmarks. Use 32k native and chunk-summarise instead.
- Tokeniser stop tokens: same as 7B – set
stop_token_ids=[151645, 151643]to catch both im_end and endoftext. - Cold start time: 14B FP8 weights take ~25 seconds to load from NVMe and another 12 seconds to JIT-compile CUDA graphs. Pre-warm before exposing to traffic.
- Power-cap discipline: at sustained batch 16, the 4090 will draw the full 450 W. Cap to 400 W with
nvidia-smi -pl 400for thermal stability – throughput loss is under 5%. - AWQ Marlin requires SM 8.0+: the 4090 is SM 8.9 so Marlin works, but mixed-fleet deployments need branch logic.
When to pick this over alternatives
Pick Qwen 2.5 14B over Qwen 2.5 7B when 7B’s MMLU 74.8 is not enough or when math (GSM8K 85.4 vs 90.2) is a hard requirement. Pick it over Mistral Nemo 12B when code matters (HumanEval 86.6 vs 50.0). Pick it over Qwen 2.5 32B when you need more than 8 concurrent users – 14B AWQ comfortably handles 32, 32B AWQ tops out at ~8.
Step up to 32B only when MMLU 83+ is a hard quality floor. Step laterally to Qwen 2.5 Coder 14B for pure coding workloads – same architecture, code-tuned weights with FIM support and HumanEval 83.5/MBPP 78.4. Step down to 7B if you are running a high-concurrency multi-tenant chat service where 14B’s latency budget is too tight.
Verdict
Qwen 2.5 14B-Instruct on the RTX 4090 24GB is the best single-card 14B deployment in the open-weight space full stop. FP8 gives 110 t/s decode at 32k context with room for 16 concurrent users; AWQ pushes 135 t/s with 32 user capacity. Quality matches or beats models 5x larger on code and math. For a 12-engineer coding team with a self-hosted assistant, a 50-tenant SaaS RAG, or a research lab needing strong reasoning without API costs, this is the default 4090 choice. The only reasons to look elsewhere are extreme concurrency (drop to 7B), absolute peak quality (move to 32B AWQ or rent an H100), or pure coding (Qwen Coder 14B).
Run Qwen 2.5 14B on a UK RTX 4090
FP8 weights, 32k context, 110+ tokens per second. UK dedicated hosting from Manchester.
Order the RTX 4090 24GBSee also: 4090 vs 5090 32GB, 5060 Ti Qwen 14B benchmark, Qwen 2.5 32B VRAM, FP8 deployment, concurrent users, monthly cost, multi-tenant SaaS.