Table of Contents
Precision Levels Explained
Every LLM deployment on a dedicated GPU server involves a fundamental choice: how many bits per weight? FP16 (16-bit floating point) preserves the original model quality but uses the most memory and bandwidth. INT8 (8-bit integer) halves the memory with near-zero quality loss. INT4 (4-bit integer) quarters the memory but introduces measurable quality degradation. This guide helps you pick the right precision for your workload.
These precision levels can be applied through different quantisation formats — see our GPTQ vs AWQ vs GGUF guide for format-specific details. For GPU-specific comparisons, check our model pages for LLaMA 3 8B and Mistral 7B.
Speed Comparison Across GPUs
Using a representative 7B model (LLaMA 3 8B) with 512 input and 256 output tokens. INT8 uses bitsandbytes, INT4 uses GPTQ via vLLM.
| GPU | FP16 (tok/s) | INT8 (tok/s) | INT4 (tok/s) | INT4 Speedup vs FP16 |
|---|---|---|---|---|
| RTX 4060 (8 GB) | N/A | 18 | 28 | – |
| RTX 4060 Ti (16 GB) | 32 | 38 | 45 | 1.41x |
| RTX 3090 (24 GB) | 43 | 52 | 61 | 1.42x |
| RTX 5080 (16 GB) | 68 | 79 | 92 | 1.35x |
| RTX 5090 (32 GB) | 95 | 112 | 130 | 1.37x |
INT4 delivers a consistent 35-42% speed improvement over FP16 because the bottleneck for most LLM inference is memory bandwidth, not compute. Smaller weights transfer faster from VRAM to processing cores. INT8 falls between the two, offering 18-20% speedup with less quality trade-off.
VRAM Savings
The table below shows VRAM for several popular models at each precision level (8K context, batch size 1).
| Model | FP16 | INT8 | INT4 | Savings (FP16 to INT4) |
|---|---|---|---|---|
| Mistral 7B | 15.5 GB | 9.0 GB | 5.3 GB | 66% |
| LLaMA 3 8B | 17.5 GB | 10.0 GB | 6.2 GB | 65% |
| LLaMA 3 70B | 151 GB | 78 GB | 47 GB | 69% |
| Qwen 2.5 72B | 153 GB | 80 GB | 47 GB | 69% |
| Mixtral 8x7B | 96 GB | 51 GB | 29 GB | 70% |
INT4 consistently frees 65-70% of the VRAM used by FP16. For larger models, this can mean the difference between needing 2 GPUs and needing 8. The freed VRAM can also be used for longer context windows — see our context length VRAM guide for scaling details.
Quality Trade-Offs
Quality degradation varies by model size and task complexity. Larger models tolerate quantisation better.
| Model Size | INT8 Quality (vs FP16) | INT4 Quality (vs FP16) |
|---|---|---|
| 7B parameters | 98-99% | 94-97% |
| 13-14B parameters | 99% | 95-97% |
| 70B+ parameters | 99%+ | 96-98% |
Key patterns to consider:
- Simple tasks (summarisation, classification): INT4 quality loss is imperceptible. Use INT4 to maximise speed and minimise cost.
- Coding and reasoning: INT4 can degrade output on complex multi-step logic. INT8 is the safer choice here.
- Creative writing: INT4 is generally fine — subtle weight differences rarely affect narrative quality.
- Mathematical precision: most sensitive to quantisation. Use FP16 or INT8 for scientific/mathematical workloads.
When to Use Each
- FP16: use when you have ample VRAM, need maximum quality, or are running benchmarks/evaluations. Also required for fine-tuning (though QLoRA uses INT4 base weights).
- INT8: best balance for production. Near-lossless quality with meaningful VRAM and speed improvements. Ideal for coding assistants and reasoning tasks.
- INT4: use when VRAM is constrained, speed is the priority, or tasks are straightforward. Perfect for chatbots, summarisation APIs, and cost-optimised deployments.
For model-specific format recommendations, check our guides for DeepSeek, Qwen 2.5, and Mixtral 8x7B. For detailed benchmarks, see the tokens per second benchmark hub and cost per million tokens calculator.
Conclusion
INT4 is the right default for most production LLM inference — it delivers 35-40% more speed while cutting VRAM by 65-70%, with quality loss under 6% on most tasks. INT8 is the safer pick for quality-sensitive workloads. FP16 is reserved for evaluation, fine-tuning, or when VRAM is plentiful. Match your precision to your task requirements, and you will get the most value from your LLM hosting deployment.
Deploy LLMs at Any Precision Level
Dedicated GPU servers supporting FP16, INT8, and INT4 inference out of the box. Choose the right hardware for your precision needs.
Browse GPU Servers