RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Quantized Model Loading Issues: GPTQ/AWQ Fix
Tutorials

vLLM Quantized Model Loading Issues: GPTQ/AWQ Fix

Fix vLLM failures when loading GPTQ and AWQ quantized models. Covers missing quantization libraries, config mismatches, unsupported formats, and correct launch parameters.

Quantized Model Loading Errors

You point vLLM at a quantized model and get:

ValueError: Quantization method gptq is not supported. Supported methods: ['awq', 'gptq', 'squeezellm', 'marlin']
ImportError: auto_gptq is required for GPTQ quantization. Install with: pip install auto-gptq
RuntimeError: Error in model loading: shape mismatch for layers.0.self_attn.q_proj.qweight

Quantized models reduce VRAM usage by storing weights in 4-bit or 8-bit formats instead of 16-bit. They are essential for running large models on consumer-grade GPUs. But loading them in vLLM requires the right combination of library versions, launch flags, and model format.

Step 1: Install Required Libraries

vLLM does not include GPTQ or AWQ backends by default. Install them:

# For GPTQ models
pip install auto-gptq

# For AWQ models
pip install autoawq

# Verify installation
python -c "import auto_gptq; print(f'AutoGPTQ: {auto_gptq.__version__}')"
python -c "import awq; print('AWQ available')"

On a dedicated GPU server, ensure these are installed in the same Python environment as vLLM.

Step 2: Specify the Quantization Method

vLLM needs to know which quantization format the model uses:

# For GPTQ models
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq

# For AWQ models
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-AWQ \
  --quantization awq

Without the --quantization flag, vLLM tries to load the model as FP16, which fails because the weight shapes do not match FP16 expectations.

Step 3: Fix Format and Config Mismatches

Not all quantized models are compatible with vLLM. Check the model’s quantize_config.json:

python -c "
import json
with open('/models/my-gptq-model/quantize_config.json') as f:
    config = json.load(f)
    print(json.dumps(config, indent=2))
"

vLLM requires:

  • GPTQ: bits: 4, group_size: 128 (or -1), desc_act: false preferred for Marlin kernel compatibility.
  • AWQ: bits: 4, group_size: 128. AWQ with group_size: 64 is supported but less common.

Models quantized with desc_act: true (also called act-order) may load but with reduced performance because the fast Marlin kernel does not support act-order. vLLM falls back to a slower dequantization path.

Step 4: Enable the Marlin Kernel for GPTQ

Marlin is a faster GPTQ kernel that dramatically improves inference throughput:

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization marlin

Requirements for Marlin: bits=4, group_size=128, desc_act=false, and sym=true. If the model does not meet these, vLLM falls back to standard GPTQ. On GPU servers with Ampere or newer GPUs, Marlin provides a significant speed advantage.

Choosing Between AWQ and GPTQ

For vLLM serving, AWQ generally offers better out-of-the-box compatibility and similar quality to GPTQ. If you have a choice of quantized model format:

  • AWQ: More reliable loading in vLLM, good inference speed, widely available on Hugging Face.
  • GPTQ with Marlin: Fastest inference if the model meets Marlin requirements.
  • GPTQ without Marlin: Slower than AWQ in vLLM; generally not recommended unless Marlin is available.

Verification and Performance Check

# Verify the model loads and serves
curl http://localhost:8000/v1/models

# Test inference quality
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"TheBloke/Llama-2-13B-AWQ","prompt":"The capital of France is","max_tokens":20}'

# Check VRAM savings
nvidia-smi

A 4-bit quantized 13B model should use roughly 7 GB of VRAM instead of 26 GB, leaving substantial room for KV cache. Tune memory and throughput using our vLLM optimization guide. For production deployment, follow our vLLM production guide. For custom PyTorch inference with quantized models outside vLLM, see the tutorials section.

Run Larger Models on Less VRAM

Quantized models let you serve bigger LLMs on GigaGPU’s dedicated GPU servers. Find the right hardware for your needs.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?