RTX 3050 - Order Now
Home / Blog / LLM Hosting / FP16 vs INT8 vs INT4: When to Use Each for LLM Inference
LLM Hosting

FP16 vs INT8 vs INT4: When to Use Each for LLM Inference

A practical guide to choosing between FP16, INT8, and INT4 precision for LLM inference, with speed, quality, and VRAM trade-offs across popular models and GPUs.

Precision Levels Explained

Every LLM deployment on a dedicated GPU server involves a fundamental choice: how many bits per weight? FP16 (16-bit floating point) preserves the original model quality but uses the most memory and bandwidth. INT8 (8-bit integer) halves the memory with near-zero quality loss. INT4 (4-bit integer) quarters the memory but introduces measurable quality degradation. This guide helps you pick the right precision for your workload.

These precision levels can be applied through different quantisation formats — see our GPTQ vs AWQ vs GGUF guide for format-specific details. For GPU-specific comparisons, check our model pages for LLaMA 3 8B and Mistral 7B.

Speed Comparison Across GPUs

Using a representative 7B model (LLaMA 3 8B) with 512 input and 256 output tokens. INT8 uses bitsandbytes, INT4 uses GPTQ via vLLM.

GPUFP16 (tok/s)INT8 (tok/s)INT4 (tok/s)INT4 Speedup vs FP16
RTX 4060 (8 GB)N/A1828
RTX 4060 Ti (16 GB)3238451.41x
RTX 3090 (24 GB)4352611.42x
RTX 5080 (16 GB)6879921.35x
RTX 5090 (32 GB)951121301.37x

INT4 delivers a consistent 35-42% speed improvement over FP16 because the bottleneck for most LLM inference is memory bandwidth, not compute. Smaller weights transfer faster from VRAM to processing cores. INT8 falls between the two, offering 18-20% speedup with less quality trade-off.

VRAM Savings

The table below shows VRAM for several popular models at each precision level (8K context, batch size 1).

ModelFP16INT8INT4Savings (FP16 to INT4)
Mistral 7B15.5 GB9.0 GB5.3 GB66%
LLaMA 3 8B17.5 GB10.0 GB6.2 GB65%
LLaMA 3 70B151 GB78 GB47 GB69%
Qwen 2.5 72B153 GB80 GB47 GB69%
Mixtral 8x7B96 GB51 GB29 GB70%

INT4 consistently frees 65-70% of the VRAM used by FP16. For larger models, this can mean the difference between needing 2 GPUs and needing 8. The freed VRAM can also be used for longer context windows — see our context length VRAM guide for scaling details.

Quality Trade-Offs

Quality degradation varies by model size and task complexity. Larger models tolerate quantisation better.

Model SizeINT8 Quality (vs FP16)INT4 Quality (vs FP16)
7B parameters98-99%94-97%
13-14B parameters99%95-97%
70B+ parameters99%+96-98%

Key patterns to consider:

  • Simple tasks (summarisation, classification): INT4 quality loss is imperceptible. Use INT4 to maximise speed and minimise cost.
  • Coding and reasoning: INT4 can degrade output on complex multi-step logic. INT8 is the safer choice here.
  • Creative writing: INT4 is generally fine — subtle weight differences rarely affect narrative quality.
  • Mathematical precision: most sensitive to quantisation. Use FP16 or INT8 for scientific/mathematical workloads.

When to Use Each

  • FP16: use when you have ample VRAM, need maximum quality, or are running benchmarks/evaluations. Also required for fine-tuning (though QLoRA uses INT4 base weights).
  • INT8: best balance for production. Near-lossless quality with meaningful VRAM and speed improvements. Ideal for coding assistants and reasoning tasks.
  • INT4: use when VRAM is constrained, speed is the priority, or tasks are straightforward. Perfect for chatbots, summarisation APIs, and cost-optimised deployments.

For model-specific format recommendations, check our guides for DeepSeek, Qwen 2.5, and Mixtral 8x7B. For detailed benchmarks, see the tokens per second benchmark hub and cost per million tokens calculator.

Conclusion

INT4 is the right default for most production LLM inference — it delivers 35-40% more speed while cutting VRAM by 65-70%, with quality loss under 6% on most tasks. INT8 is the safer pick for quality-sensitive workloads. FP16 is reserved for evaluation, fine-tuning, or when VRAM is plentiful. Match your precision to your task requirements, and you will get the most value from your LLM hosting deployment.

Deploy LLMs at Any Precision Level

Dedicated GPU servers supporting FP16, INT8, and INT4 inference out of the box. Choose the right hardware for your precision needs.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?