Table of Contents
Why DeepSeek Quantisation Is Different
DeepSeek V3 and R1 are Mixture-of-Experts (MoE) models with 671 billion total parameters, making quantisation not just an optimisation but a necessity for practical dedicated GPU hosting. Unlike dense models where quantisation is optional, DeepSeek’s sheer size means you almost certainly need some form of weight compression to fit it on available hardware. The MoE architecture also introduces unique considerations — expert routing weights must remain precise while individual expert layers can tolerate more aggressive quantisation.
For baseline VRAM requirements see our DeepSeek VRAM requirements guide. For context on how different formats work generally, read the GPTQ vs AWQ vs GGUF quantisation guide.
Available Quantisation Formats
DeepSeek V3 was trained natively with FP8 precision, which gives it an advantage over models that must be post-training quantised from FP16. Here are the primary options:
- FP8 (native): the default recommended precision. No post-training quantisation needed — weights are already in FP8 from training. Minimal quality loss.
- GPTQ 4-bit: aggressive weight-only quantisation using calibration data. Reduces model to ~175 GB. Requires ExLlama v2 kernels for optimal speed.
- AWQ 4-bit: activation-aware quantisation that preserves important weights at higher precision. Similar size to GPTQ with slightly better quality on reasoning tasks.
- GGUF (various): llama.cpp format supporting CPU/GPU hybrid inference. Useful for partial offloading when GPU VRAM is insufficient for the full model.
- INT8 (W8A8): 8-bit weight and activation quantisation. Larger than INT4 but better quality preservation.
Performance by Format and GPU
The table below shows approximate output speed in tokens per second for DeepSeek V3 at different quantisation levels. Due to the model’s size, all configurations use tensor parallelism across multiple GPUs. Measured with vLLM at 512 input / 256 output tokens.
| GPU Config | Total VRAM | FP8 (tok/s) | INT8 (tok/s) | INT4 GPTQ (tok/s) | INT4 AWQ (tok/s) |
|---|---|---|---|---|---|
| 8x RTX 3090 | 192 GB | N/A | N/A | 8 | 7 |
| 8x RTX 5090 | 256 GB | N/A | 12 | 18 | 17 |
| 4x RTX 6000 Pro 96 GB | 320 GB | N/A | 15 | 22 | 21 |
| 8x RTX 6000 Pro 96 GB | 640 GB | 25 | 28 | 35 | 34 |
| 8x RTX 6000 Pro 96 GB | 640 GB | 42 | 46 | 55 | 53 |
INT4 quantisation delivers the highest throughput because smaller weights mean faster memory transfers — the bottleneck for MoE models with many experts. FP8 requires at least 340 GB of VRAM, limiting it to 8-GPU setups with 80 GB cards.
Quality Comparison
Quality preservation is critical for DeepSeek, especially for its reasoning capabilities (R1). We evaluated each format on a mix of coding, reasoning, and general knowledge benchmarks.
| Format | Coding (HumanEval) | Reasoning (MATH) | General (MMLU) | Overall Quality |
|---|---|---|---|---|
| FP8 (native) | 98% | 99% | 99% | Baseline |
| INT8 (W8A8) | 97% | 98% | 98% | Near-lossless |
| AWQ 4-bit | 94% | 95% | 96% | Minor degradation |
| GPTQ 4-bit | 93% | 94% | 95% | Minor degradation |
| GGUF Q4_K_M | 93% | 94% | 96% | Minor degradation |
Scores are shown as a percentage of the FP8 baseline. FP8 and INT8 are essentially lossless. INT4 formats show 4-7% degradation on the most demanding tasks, which is acceptable for most production workloads but worth testing on your specific use case.
Best Format for Each GPU Config
| GPU Configuration | Recommended Format | Why |
|---|---|---|
| 8x RTX 6000 Pro 96 GB | FP8 (native) | Enough VRAM for full FP8, best quality, native support |
| 8x RTX 6000 Pro 96 GB | FP8 (native) | 640 GB is sufficient, use FP8 for maximum quality |
| 4x RTX 6000 Pro 96 GB | INT4 AWQ | 320 GB requires quantisation; AWQ balances quality and speed |
| 8x RTX 5090 | INT4 AWQ | 256 GB too small for FP8; AWQ preferred for reasoning quality |
| 8x RTX 3090 | INT4 GPTQ | 192 GB is tight; GPTQ maximises speed on limited VRAM |
For context length scaling at each quantisation level, see our DeepSeek context length VRAM guide. For multi-GPU splitting strategies, read about model sharding for large models.
Conclusion
DeepSeek’s native FP8 training makes it uniquely well-suited for FP8 inference — use it whenever you have the VRAM budget (640+ GB). When memory is constrained, AWQ 4-bit offers the best quality/speed balance, while GPTQ 4-bit maximises throughput on tight configurations. Avoid GGUF for pure GPU serving unless you specifically need CPU offloading. For DeepSeek hosting options, explore the GPU configurations below.
Deploy DeepSeek on Dedicated GPUs
Multi-GPU clusters with up to 640 GB VRAM, built for large MoE model inference at any quantisation level.
Browse GPU Servers