RTX 3050 - Order Now
Home / Blog / Tutorials / Quantisation-Aware Fine-Tuning
Tutorials

Quantisation-Aware Fine-Tuning

QAT (quantisation-aware training) for LLMs — train with simulated low-precision so the deployed quantised model holds quality.

Standard fine-tuning produces FP16 weights, then post-training quantisation (PTQ) compresses to INT4 / FP8. Quantisation-aware training (QAT) simulates the target precision during training so the model adapts to the quantisation noise. Net: smaller quality drop than naive PTQ.

TL;DR

QAT inserts "fake quant" nodes during training; gradients flow through them; weights learn to be robust to quantisation rounding. Quality drop after deployment quantisation: typically 0.2-0.5% vs 1-2% for naive PTQ. Cost: ~1.5× training time. Worth it for: production deployments where the quantised quality matters.

Why QAT

Naive flow: train FP16, post-train quantise to INT4. Resulting INT4 model loses ~1-2% quality on benchmarks. The cause: FP16 weights happen to land at values that round badly under INT4 quantisation.

QAT flow: simulate INT4 rounding during training's forward pass; backward pass uses straight-through estimator. The model's weights converge to values that round well. Resulting INT4 deployed model: typically 0.2-0.5% drop instead of 1-2%.

Recipe

For QAT on Llama 3.1 8B with bitsandbytes / Optimum / TorchAO:

  • Standard fine-tuning recipe (TRL + PEFT)
  • Add quantisation simulation via TorchAO or HuggingFace Optimum-Intel
  • Train normally; simulated quantisation is in the forward pass
  • At deployment: actual quantisation produces real INT4 weights ready for vLLM serving

When worth it

  • Production deployment where quantised quality matters: customer-facing, eval-bounded
  • Aggressive quantisation: INT4 or FP4; QAT helps more than for INT8
  • Custom fine-tunes: where you control the training
  • Don't bother for: prototyping, experiments, models you don't deploy in INT4

Verdict

For production custom fine-tunes that will deploy at INT4 / FP4, QAT is worth the ~1.5× training time. Quality preservation is meaningfully better. For prototypes or models that stay at FP8, standard fine-tuning + PTQ is fine.

Bottom line

QAT for production INT4 deployment. See LoRA methods.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?