RTX 3050 - Order Now
Home / Blog / Tutorials / AWQ Quantization Guide for RTX 5060 Ti 16GB
Tutorials

AWQ Quantization Guide for RTX 5060 Ti 16GB

AWQ INT4 serving guide for Blackwell 16GB - checkpoint selection, vLLM config, Marlin kernel performance, and when to pick AWQ over FP8.

When models exceed what FP8 can comfortably fit on the RTX 5060 Ti 16GB, AWQ INT4 is the next step down on our hosting. Strong quality retention, Marlin kernels for speed, wide checkpoint availability.

Contents

Why AWQ

AWQ (Activation-aware Weight Quantization) is 4-bit with strong quality retention:

  • Typically within 1-2% of FP16 quality on MMLU, HumanEval
  • Faster than GPTQ on modern Nvidia via Marlin kernels
  • Widely supported – vLLM, TGI, SGLang all handle AWQ
  • Pre-quantised checkpoints available for most popular models

Checkpoints

Common production AWQ checkpoints:

  • Qwen/Qwen2.5-14B-Instruct-AWQ
  • Qwen/Qwen2.5-Coder-14B-Instruct-AWQ
  • Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
  • TheBloke/Mistral-7B-Instruct-v0.3-AWQ
  • hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
  • TheBloke/Llama-2-13B-Chat-AWQ (legacy)

Serving

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --dtype half

--dtype half explicitly asks for FP16 compute with INT4 weights. --dtype auto is also fine.

Marlin Kernels

Marlin is an optimised CUDA kernel for INT4 quantised matmul. On Blackwell it runs within 5-10% of FP16 tensor core throughput while using 4x less memory bandwidth. vLLM uses Marlin automatically for AWQ on Blackwell – no flag needed.

Throughput with Marlin:

  • Mistral 7B AWQ batch 1: ~130 t/s
  • Qwen 14B AWQ batch 1: ~44 t/s
  • Llama 3 8B AWQ batch 16 aggregate: ~900 t/s

AWQ vs FP8

CriterionFP8AWQ
QualityNear-FP16Within 1-2% of FP16
Weight size50% of FP1625% of FP16
Speed on BlackwellFastest (native)Marlin, slightly slower
Checkpoint availabilityGrowingVery wide

Use FP8 when a quality-preserving checkpoint exists. Use AWQ when FP8 is unavailable or when maximum VRAM savings is needed for larger models on 16 GB.

AWQ-Optimised Hosting

Blackwell 16GB with Marlin kernels for AWQ serving. UK dedicated.

Order the RTX 5060 Ti 16GB

See also: quantisation format comparison, GPTQ guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?