Table of Contents
Why Format Choice Matters
LLaMA 3 8B is one of the most deployed open-weight models, and quantisation format directly impacts both inference speed and output quality on your dedicated GPU server. GPTQ, AWQ, and GGUF each take a different approach to compressing model weights, and their performance varies significantly by GPU architecture. This benchmark compares all three formats at 4-bit precision across six GPUs.
For an overview of how these formats work, see our GPTQ vs AWQ vs GGUF quantisation guide. For baseline VRAM requirements at FP16, check the LLaMA 3 VRAM requirements page.
Speed Benchmarks by GPU and Format
All benchmarks use 4-bit quantisation (INT4) with 512 input tokens and 256 output tokens. GPTQ uses ExLlama v2 kernels via vLLM, AWQ uses the AutoAWQ backend, and GGUF uses llama.cpp with full GPU offload. Measured on GigaGPU servers with identical configurations.
| GPU | VRAM | FP16 (tok/s) | GPTQ 4-bit (tok/s) | AWQ 4-bit (tok/s) | GGUF Q4_K_M (tok/s) |
|---|---|---|---|---|---|
| RTX 3050 | 6 GB | N/A | 18 | 17 | 15 |
| RTX 4060 | 8 GB | N/A | 28 | 26 | 22 |
| RTX 4060 Ti | 16 GB | 32 | 45 | 43 | 35 |
| RTX 3090 | 24 GB | 43 | 58 | 55 | 46 |
| RTX 5080 | 16 GB | 68 | 88 | 85 | 70 |
| RTX 5090 | 32 GB | 95 | 125 | 120 | 98 |
GPTQ consistently leads by 2-5% over AWQ, thanks to highly optimised ExLlama v2 kernels. GGUF (llama.cpp) trails by 20-25% on pure GPU workloads — its strength lies in CPU/GPU hybrid inference, not full GPU offload. For similar benchmarks on other models, see our Mistral 7B speed comparison.
VRAM Usage Comparison
All three formats target 4-bit precision, but actual memory consumption differs slightly due to format overhead and different quantisation group sizes.
| Format | Model Size on Disk | VRAM (idle) | VRAM (8K context) |
|---|---|---|---|
| FP16 | 16.1 GB | 16.5 GB | 17.5 GB |
| GPTQ 4-bit (g128) | 4.7 GB | 5.2 GB | 6.2 GB |
| AWQ 4-bit (g128) | 4.6 GB | 5.1 GB | 6.1 GB |
| GGUF Q4_K_M | 4.9 GB | 5.4 GB | 6.4 GB |
AWQ has a marginally smaller footprint, while GGUF Q4_K_M is slightly larger due to mixed quantisation groups that preserve quality in sensitive layers. The differences are small enough that speed and ecosystem should drive your choice. For context on how VRAM scales with longer sequences, see our LLaMA 3 8B context length VRAM guide.
Quality Impact by Format
We evaluated perplexity on a standardised test set (WikiText-2) to measure quality degradation from each quantisation format.
| Format | Perplexity | Delta vs FP16 |
|---|---|---|
| FP16 (baseline) | 6.14 | – |
| GPTQ 4-bit (g128) | 6.31 | +0.17 |
| AWQ 4-bit (g128) | 6.27 | +0.13 |
| GGUF Q4_K_M | 6.24 | +0.10 |
GGUF Q4_K_M preserves quality best because it uses higher precision for sensitive layers. AWQ edges out GPTQ on quality, while GPTQ wins on speed. The differences are small — under 3% perplexity increase from FP16 for all formats.
Which Format to Choose
- GPTQ 4-bit: best choice for pure GPU inference via vLLM or text-generation-inference. Fastest kernels, excellent ecosystem support, great for production APIs.
- AWQ 4-bit: slightly better quality than GPTQ with near-identical speed. Preferred if you run vLLM (native AWQ support) and quality is a priority.
- GGUF Q4_K_M: best for CPU/GPU hybrid setups, edge deployments, or when using llama.cpp. Not recommended for pure GPU serving due to lower throughput.
For a comparison of how these same formats perform on larger models, see our DeepSeek quantisation guide. For broader format details, the FP16 vs INT8 vs INT4 guide covers when to use each precision level. Browse all benchmarks in the Benchmarks category.
Conclusion
For LLaMA 3 8B on GPU servers, GPTQ delivers the best speed while AWQ offers marginally better quality — both are excellent choices. GGUF is the right pick only for hybrid CPU/GPU or llama.cpp deployments. All three formats cut VRAM from ~17 GB to ~6 GB, enabling deployment on budget GPUs as small as 6-8 GB. Match your format to your serving stack, and you will get the best performance from your LLaMA hosting setup.
Run LLaMA 3 8B at Maximum Speed
Dedicated GPU servers pre-configured for vLLM with GPTQ and AWQ support. From budget RTX 4060 to flagship RTX 5090.
Browse GPU Servers