RTX 3050 - Order Now
Home / Blog / Tutorials / ExLlamaV2 Hosting on RTX 5060 Ti 16GB
Tutorials

ExLlamaV2 Hosting on RTX 5060 Ti 16GB

EXL2 on Blackwell 16GB via TabbyAPI - fastest non-vLLM LLM runtime, flexible bits-per-weight, and when to pick it over AWQ.

ExLlamaV2 (EXL2) produces fast quantised LLM inference with tight VRAM control via flexible bits-per-weight. On the RTX 5060 Ti 16GB via our hosting, EXL2 is a legitimate alternative to vLLM for specific workloads.

Contents

What EXL2 Is

EXL2 is the successor to ExLlama/ExLlamaV2. 4-bit-ish quantisation with:

  • Configurable bits-per-weight (3.0, 4.0, 5.0, 6.0, 8.0)
  • Per-layer precision tuning
  • Very fast single-user decode
  • Flash Attention v3 support

bpw Variants

bpwQualityVRAM (for 14B)
3.0Noticeable loss~6 GB
4.0Comparable to AWQ~8 GB
5.0Near-FP16~10 GB
6.0Near-FP16~11 GB
8.0Essentially FP16~14 GB

TabbyAPI

TabbyAPI is the production wrapper for EXL2 serving:

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
python start.py

Configure model path, max context, cache mode in config.yml. Exposes OpenAI-compatible endpoints on port 5000 by default.

vs vLLM + AWQ

  • EXL2 is typically 10-20% faster than AWQ on same model at single-user batch 1
  • vLLM with AWQ wins at high concurrency via PagedAttention
  • EXL2 has lower memory overhead per sequence
  • vLLM has broader ecosystem (OpenAI-compatible features, better tooling)

Pick EXL2 for solo developer or small team setups with single-user workflows. Pick vLLM for production APIs with concurrent users.

EXL2 on Blackwell

Fast single-user inference via TabbyAPI. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: format comparison, AWQ guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?