ExLlamaV2 (EXL2) produces fast quantised LLM inference with tight VRAM control via flexible bits-per-weight. On the RTX 5060 Ti 16GB via our hosting, EXL2 is a legitimate alternative to vLLM for specific workloads.
Contents
What EXL2 Is
EXL2 is the successor to ExLlama/ExLlamaV2. 4-bit-ish quantisation with:
- Configurable bits-per-weight (3.0, 4.0, 5.0, 6.0, 8.0)
- Per-layer precision tuning
- Very fast single-user decode
- Flash Attention v3 support
bpw Variants
| bpw | Quality | VRAM (for 14B) |
|---|---|---|
| 3.0 | Noticeable loss | ~6 GB |
| 4.0 | Comparable to AWQ | ~8 GB |
| 5.0 | Near-FP16 | ~10 GB |
| 6.0 | Near-FP16 | ~11 GB |
| 8.0 | Essentially FP16 | ~14 GB |
TabbyAPI
TabbyAPI is the production wrapper for EXL2 serving:
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
python start.py
Configure model path, max context, cache mode in config.yml. Exposes OpenAI-compatible endpoints on port 5000 by default.
vs vLLM + AWQ
- EXL2 is typically 10-20% faster than AWQ on same model at single-user batch 1
- vLLM with AWQ wins at high concurrency via PagedAttention
- EXL2 has lower memory overhead per sequence
- vLLM has broader ecosystem (OpenAI-compatible features, better tooling)
Pick EXL2 for solo developer or small team setups with single-user workflows. Pick vLLM for production APIs with concurrent users.
EXL2 on Blackwell
Fast single-user inference via TabbyAPI. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: format comparison, AWQ guide.