RTX 3050 - Order Now
Home / Blog / LLM Hosting / ExLlamaV2 vs vLLM: Quantized Model Speed Comparison
LLM Hosting

ExLlamaV2 vs vLLM: Quantized Model Speed Comparison

Comparing ExLlamaV2 and vLLM for quantized LLM inference speed. EXL2 format performance versus AWQ/GPTQ on dedicated GPU servers with detailed benchmarks.

Quick Verdict: ExLlamaV2 vs vLLM for Quantized Models

ExLlamaV2 generates tokens 20-40% faster than vLLM for single-user quantized inference on consumer GPUs. Running a Llama 3 70B EXL2 4-bit model on dual RTX 5090s, ExLlamaV2 achieves 45 tokens per second while vLLM with AWQ reaches 32 tokens per second. However, vLLM reclaims the lead at higher concurrency levels thanks to its batching architecture. This comparison matters for anyone deploying quantized models on dedicated GPU hosting and needing to choose between raw single-stream speed and scalable throughput.

Architecture and Feature Comparison

ExLlamaV2 is a CUDA-optimised inference library built specifically for quantized transformer models. Its custom kernels are hand-tuned for the EXL2 quantization format, which supports variable bits-per-weight across different layers based on sensitivity analysis. This per-layer calibration preserves quality where it matters most while aggressively compressing less critical layers.

vLLM supports AWQ and GPTQ quantization through Marlin kernels, delivering good quantized performance within its broader continuous batching framework. While vLLM prioritises throughput at scale on vLLM hosting, ExLlamaV2 prioritises generation speed for individual requests. Both serve quantized models effectively, but their optimization targets differ fundamentally.

FeatureExLlamaV2vLLM
Quantization FormatEXL2 (variable bit-rate)AWQ, GPTQ, FP8
Single-User Speed (70B, 4-bit)~45 tok/s (dual 5090)~32 tok/s (dual 5090)
Concurrent Request HandlingBasic paged cacheContinuous batching
Quality at 4-bitHigher (variable bit allocation)Good (uniform quantization)
VRAM EfficiencyExcellent with EXL2 pagingGood with PagedAttention
API ServerTabbyAPI or customBuilt-in OpenAI-compatible
Multi-GPULayer splittingTensor parallelism
Kernel OptimizationHand-tuned for EXL2Marlin kernels for AWQ/GPTQ

Performance Benchmark Results

On a single RTX 5090 32 GB running Llama 3 8B at 4-bit quantization, ExLlamaV2 delivers 110 tokens per second compared to vLLM at 85 tokens per second. The EXL2 kernels are specifically optimised for the dequantization patterns that occur during autoregressive generation, giving ExLlamaV2 a consistent edge in single-stream scenarios.

Scaling to 16 concurrent users reverses the ranking. vLLM reaches 1,400 tokens per second total throughput while ExLlamaV2 with TabbyAPI manages around 900 tokens per second. vLLM’s continuous batching efficiently combines multiple requests into single GPU operations, an optimization that ExLlamaV2’s architecture does not match. For production workloads on multi-GPU clusters, this throughput advantage matters significantly. See our GPU selection guide for hardware pairing.

Cost Analysis

ExLlamaV2’s variable bit-rate quantization can run a 70B model at near-5-bit average quality in 35GB of VRAM, fitting on a single 48GB GPU. The equivalent vLLM AWQ deployment uses approximately 40GB. This 12% VRAM saving occasionally means the difference between a single-GPU and dual-GPU setup, representing a meaningful cost difference on dedicated GPU servers.

For open-source LLM hosting serving many users, vLLM processes more total requests per GPU hour at scale, making it cheaper per token despite lower single-request speed. ExLlamaV2 wins on cost when serving a small number of users who need the fastest possible response times on private AI hosting.

When to Use Each

Choose ExLlamaV2 when: You prioritise single-user generation speed, need the highest quality from quantized models through variable bit-rate allocation, or want to run the largest possible models on limited VRAM. It excels for personal assistants, single-user coding tools, and interactive applications where response speed directly impacts user experience.

Choose vLLM when: You serve multiple concurrent users and need maximum aggregate throughput. Its continuous batching and OpenAI-compatible API make it the better production serving engine for quantized models. Deploy on GigaGPU vLLM hosting for scalable inference.

Recommendation

For single-user or low-concurrency applications, ExLlamaV2 with EXL2 quantization offers the best combination of speed and quality per VRAM dollar. For multi-user production APIs, vLLM with AWQ delivers higher total throughput. Consider running ExLlamaV2 behind TabbyAPI for small-scale deployments and vLLM for anything beyond 8 concurrent users. Benchmark both on a GigaGPU dedicated server with your target model. Our self-hosted LLM guide covers quantization deployment, and the LLM hosting hub offers further comparisons including our vLLM vs Ollama analysis for PyTorch-based setups.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?