Home / Blog / Tutorials / Quantisation-Aware Fine-Tuning

Tutorials

Quantisation-Aware Fine-Tuning

QAT (quantisation-aware training) for LLMs — train with simulated low-precision so the deployed quantised model holds quality.

Tutorials May 6, 2026 2 min read gigagpu

Table of Contents

Standard fine-tuning produces FP16 weights, then post-training quantisation (PTQ) compresses to INT4 / FP8. Quantisation-aware training (QAT) simulates the target precision during training so the model adapts to the quantisation noise. Net: smaller quality drop than naive PTQ.

TL;DR

QAT inserts "fake quant" nodes during training; gradients flow through them; weights learn to be robust to quantisation rounding. Quality drop after deployment quantisation: typically 0.2-0.5% vs 1-2% for naive PTQ. Cost: ~1.5× training time. Worth it for: production deployments where the quantised quality matters.

Why QAT

Naive flow: train FP16, post-train quantise to INT4. Resulting INT4 model loses ~1-2% quality on benchmarks. The cause: FP16 weights happen to land at values that round badly under INT4 quantisation.

QAT flow: simulate INT4 rounding during training's forward pass; backward pass uses straight-through estimator. The model's weights converge to values that round well. Resulting INT4 deployed model: typically 0.2-0.5% drop instead of 1-2%.

Recipe

For QAT on Llama 3.1 8B with bitsandbytes / Optimum / TorchAO:

Standard fine-tuning recipe (TRL + PEFT)
Add quantisation simulation via TorchAO or HuggingFace Optimum-Intel
Train normally; simulated quantisation is in the forward pass
At deployment: actual quantisation produces real INT4 weights ready for vLLM serving

When worth it

Production deployment where quantised quality matters: customer-facing, eval-bounded
Aggressive quantisation: INT4 or FP4; QAT helps more than for INT8
Custom fine-tunes: where you control the training
Don't bother for: prototyping, experiments, models you don't deploy in INT4

Verdict

For production custom fine-tunes that will deploy at INT4 / FP4, QAT is worth the ~1.5× training time. Quality preservation is meaningfully better. For prototypes or models that stay at FP8, standard fine-tuning + PTQ is fine.

Bottom line

QAT for production INT4 deployment. See LoRA methods.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Quantisation-Aware Fine-Tuning

Why QAT

Recipe

When worth it

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Quantisation-Aware Fine-Tuning

Why QAT

Recipe

When worth it

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

AI Workflow: Celery + Redis + GPU

Connect MongoDB to AI Pipeline on GPU

LoRA Fine-Tuning Mistral 7B on a Dedicated GPU

RTX 5060 Ti 16GB Sanity Test Script

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?