Running large language models on a dedicated GPU server often comes down to one critical decision: which quantization format to use. GPTQ, AWQ, and GGUF each offer different trade-offs between inference speed, memory footprint, and output quality. Choosing the right format can mean the difference between fitting a 70B model on a single GPU or needing an expensive multi-GPU cluster. This guide breaks down all three formats with practical benchmarks and installation commands so you can deploy quantized models on your LLM hosting setup today.
What Is LLM Quantization?
Quantization reduces model weight precision from 16-bit floating point (FP16) down to 8-bit, 4-bit, or even 2-bit integers. This dramatically cuts VRAM usage and speeds up inference. A 70B parameter model at FP16 requires roughly 140 GB of VRAM. At 4-bit quantization, that drops to approximately 35 GB — easily fitting on a server with two RTX 6000 Pro 96 GB GPUs. If you are evaluating hardware, our best GPU for LLM inference guide covers the ideal cards for quantized workloads.
The three dominant quantization formats in 2025 are GPTQ, AWQ, and GGUF. Each uses a different algorithm and targets different runtime environments.
GPTQ: GPU-Optimised Post-Training Quantization
GPTQ (Generative Pre-Trained Transformer Quantization) was one of the first widely adopted methods. It performs one-shot weight quantization using a calibration dataset, producing models that run on CUDA GPUs via libraries like AutoGPTQ or ExLlamaV2.
Install AutoGPTQ and load a GPTQ model:
pip install auto-gptq optimum transformers
python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'TheBloke/Llama-2-13B-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
trust_remote_code=False
)
print('GPTQ model loaded successfully')
"
GPTQ models work natively with vLLM, making them an excellent choice for production inference servers. See our vLLM production setup guide for deployment instructions.
AWQ: Activation-Aware Weight Quantization
AWQ improves on GPTQ by identifying salient weight channels that matter most for model quality. Instead of treating all weights equally, AWQ protects the most important 1% of weights, resulting in better perplexity at the same bit-width. AWQ models are typically faster than GPTQ for inference.
Install and run an AWQ model with vLLM:
# Install vLLM with AWQ support
pip install vllm
# Serve an AWQ model
python3 -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--dtype auto \
--max-model-len 4096 \
--tensor-parallel-size 2 \
--port 8000
AWQ is now the recommended quantization format for most API hosting deployments due to its superior speed-to-quality ratio. You can measure exact throughput using the tokens per second benchmark tool.
GGUF: CPU/GPU Hybrid Quantization
GGUF (GPT-Generated Unified Format) is the format used by llama.cpp and Ollama. Unlike GPTQ and AWQ which are GPU-only, GGUF supports CPU inference and hybrid CPU/GPU splitting, making it versatile for varied hardware setups.
Install llama.cpp and serve a GGUF model:
# Build llama.cpp with CUDA support
sudo apt update && sudo apt install -y build-essential cmake libcurl4-openssl-dev
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Serve a GGUF model with GPU offloading
./build/bin/llama-server \
-m models/llama-2-13b.Q4_K_M.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8080
For a complete Ollama walkthrough, see our guide on setting up Ollama on a dedicated GPU server. To understand how Ollama compares with vLLM for GGUF workloads, read our vLLM vs Ollama comparison.
Head-to-Head Comparison Table
| Feature | GPTQ | AWQ | GGUF |
|---|---|---|---|
| Typical Bit-Width | 4-bit, 8-bit | 4-bit | 2-bit to 8-bit |
| Runtime | AutoGPTQ, vLLM, ExLlamaV2 | vLLM, AutoAWQ | llama.cpp, Ollama |
| GPU Required | Yes | Yes | Optional (hybrid) |
| Inference Speed (GPU) | Fast | Faster | Moderate |
| Quality at 4-bit | Good | Better | Good (Q4_K_M) |
| Quantization Time | Slow (hours) | Fast (minutes) | Fast (minutes) |
| Multi-GPU Support | Yes (tensor parallel) | Yes (tensor parallel) | Limited |
Choosing the Right Format for Your Workload
Choose AWQ if you want the best inference speed and quality on a dedicated GPU server. AWQ models load faster, run faster, and maintain better perplexity than GPTQ at the same bit-width. This is the format to choose for high-throughput production AI inference.
Choose GPTQ if you need compatibility with older toolchains or specific ExLlamaV2 features. GPTQ has the widest selection of pre-quantized models on Hugging Face.
Choose GGUF if you need CPU/GPU hybrid deployment or are using Ollama for rapid prototyping. GGUF’s flexible layer offloading lets you run larger models than your VRAM would otherwise allow.
To understand the real-world cost implications, use the cost per million tokens calculator to compare your options.
Practical Setup Commands
Convert a Hugging Face model to AWQ format on your GPU server:
# Install AutoAWQ
pip install autoawq
# Quantize a model to AWQ 4-bit
python3 -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Llama-2-13b-hf'
quant_path = 'llama-2-13b-awq'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
'zero_point': True,
'q_group_size': 128,
'w_bit': 4,
'version': 'GEMM'
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print('AWQ quantization complete')
"
Convert a model to GGUF format:
# From the llama.cpp directory
python3 convert_hf_to_gguf.py /path/to/model --outtype f16 --outfile model-f16.gguf
# Quantize to 4-bit
./build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
For a complete walkthrough on self-hosting any of these formats, follow our self-host LLM guide. If you are looking at running larger models across multiple cards, review multi-GPU server setup for large model inference.
Run Quantized LLMs on Dedicated GPU Servers
Deploy GPTQ, AWQ, or GGUF models on high-performance NVIDIA GPUs with full root access and NVMe storage. GigaGPU servers come pre-configured for AI workloads.
Browse GPU Servers