Quick Verdict: ExLlamaV2 vs vLLM for Quantized Models
ExLlamaV2 generates tokens 20-40% faster than vLLM for single-user quantized inference on consumer GPUs. Running a Llama 3 70B EXL2 4-bit model on dual RTX 5090s, ExLlamaV2 achieves 45 tokens per second while vLLM with AWQ reaches 32 tokens per second. However, vLLM reclaims the lead at higher concurrency levels thanks to its batching architecture. This comparison matters for anyone deploying quantized models on dedicated GPU hosting and needing to choose between raw single-stream speed and scalable throughput.
Architecture and Feature Comparison
ExLlamaV2 is a CUDA-optimised inference library built specifically for quantized transformer models. Its custom kernels are hand-tuned for the EXL2 quantization format, which supports variable bits-per-weight across different layers based on sensitivity analysis. This per-layer calibration preserves quality where it matters most while aggressively compressing less critical layers.
vLLM supports AWQ and GPTQ quantization through Marlin kernels, delivering good quantized performance within its broader continuous batching framework. While vLLM prioritises throughput at scale on vLLM hosting, ExLlamaV2 prioritises generation speed for individual requests. Both serve quantized models effectively, but their optimization targets differ fundamentally.
| Feature | ExLlamaV2 | vLLM |
|---|---|---|
| Quantization Format | EXL2 (variable bit-rate) | AWQ, GPTQ, FP8 |
| Single-User Speed (70B, 4-bit) | ~45 tok/s (dual 5090) | ~32 tok/s (dual 5090) |
| Concurrent Request Handling | Basic paged cache | Continuous batching |
| Quality at 4-bit | Higher (variable bit allocation) | Good (uniform quantization) |
| VRAM Efficiency | Excellent with EXL2 paging | Good with PagedAttention |
| API Server | TabbyAPI or custom | Built-in OpenAI-compatible |
| Multi-GPU | Layer splitting | Tensor parallelism |
| Kernel Optimization | Hand-tuned for EXL2 | Marlin kernels for AWQ/GPTQ |
Performance Benchmark Results
On a single RTX 5090 32 GB running Llama 3 8B at 4-bit quantization, ExLlamaV2 delivers 110 tokens per second compared to vLLM at 85 tokens per second. The EXL2 kernels are specifically optimised for the dequantization patterns that occur during autoregressive generation, giving ExLlamaV2 a consistent edge in single-stream scenarios.
Scaling to 16 concurrent users reverses the ranking. vLLM reaches 1,400 tokens per second total throughput while ExLlamaV2 with TabbyAPI manages around 900 tokens per second. vLLM’s continuous batching efficiently combines multiple requests into single GPU operations, an optimization that ExLlamaV2’s architecture does not match. For production workloads on multi-GPU clusters, this throughput advantage matters significantly. See our GPU selection guide for hardware pairing.
Cost Analysis
ExLlamaV2’s variable bit-rate quantization can run a 70B model at near-5-bit average quality in 35GB of VRAM, fitting on a single 48GB GPU. The equivalent vLLM AWQ deployment uses approximately 40GB. This 12% VRAM saving occasionally means the difference between a single-GPU and dual-GPU setup, representing a meaningful cost difference on dedicated GPU servers.
For open-source LLM hosting serving many users, vLLM processes more total requests per GPU hour at scale, making it cheaper per token despite lower single-request speed. ExLlamaV2 wins on cost when serving a small number of users who need the fastest possible response times on private AI hosting.
When to Use Each
Choose ExLlamaV2 when: You prioritise single-user generation speed, need the highest quality from quantized models through variable bit-rate allocation, or want to run the largest possible models on limited VRAM. It excels for personal assistants, single-user coding tools, and interactive applications where response speed directly impacts user experience.
Choose vLLM when: You serve multiple concurrent users and need maximum aggregate throughput. Its continuous batching and OpenAI-compatible API make it the better production serving engine for quantized models. Deploy on GigaGPU vLLM hosting for scalable inference.
Recommendation
For single-user or low-concurrency applications, ExLlamaV2 with EXL2 quantization offers the best combination of speed and quality per VRAM dollar. For multi-user production APIs, vLLM with AWQ delivers higher total throughput. Consider running ExLlamaV2 behind TabbyAPI for small-scale deployments and vLLM for anything beyond 8 concurrent users. Benchmark both on a GigaGPU dedicated server with your target model. Our self-hosted LLM guide covers quantization deployment, and the LLM hosting hub offers further comparisons including our vLLM vs Ollama analysis for PyTorch-based setups.