RTX 3050 - Order Now
Home / Blog / Model Guides / DeepSeek Quantization: Best Format for Each GPU
Model Guides

DeepSeek Quantization: Best Format for Each GPU

Guide to choosing the best quantisation format for DeepSeek V3 and R1 across different GPU configurations, comparing FP8, INT4, GPTQ, AWQ, and GGUF.

Why DeepSeek Quantisation Is Different

DeepSeek V3 and R1 are Mixture-of-Experts (MoE) models with 671 billion total parameters, making quantisation not just an optimisation but a necessity for practical dedicated GPU hosting. Unlike dense models where quantisation is optional, DeepSeek’s sheer size means you almost certainly need some form of weight compression to fit it on available hardware. The MoE architecture also introduces unique considerations — expert routing weights must remain precise while individual expert layers can tolerate more aggressive quantisation.

For baseline VRAM requirements see our DeepSeek VRAM requirements guide. For context on how different formats work generally, read the GPTQ vs AWQ vs GGUF quantisation guide.

Available Quantisation Formats

DeepSeek V3 was trained natively with FP8 precision, which gives it an advantage over models that must be post-training quantised from FP16. Here are the primary options:

  • FP8 (native): the default recommended precision. No post-training quantisation needed — weights are already in FP8 from training. Minimal quality loss.
  • GPTQ 4-bit: aggressive weight-only quantisation using calibration data. Reduces model to ~175 GB. Requires ExLlama v2 kernels for optimal speed.
  • AWQ 4-bit: activation-aware quantisation that preserves important weights at higher precision. Similar size to GPTQ with slightly better quality on reasoning tasks.
  • GGUF (various): llama.cpp format supporting CPU/GPU hybrid inference. Useful for partial offloading when GPU VRAM is insufficient for the full model.
  • INT8 (W8A8): 8-bit weight and activation quantisation. Larger than INT4 but better quality preservation.

Performance by Format and GPU

The table below shows approximate output speed in tokens per second for DeepSeek V3 at different quantisation levels. Due to the model’s size, all configurations use tensor parallelism across multiple GPUs. Measured with vLLM at 512 input / 256 output tokens.

GPU ConfigTotal VRAMFP8 (tok/s)INT8 (tok/s)INT4 GPTQ (tok/s)INT4 AWQ (tok/s)
8x RTX 3090192 GBN/AN/A87
8x RTX 5090256 GBN/A121817
4x RTX 6000 Pro 96 GB320 GBN/A152221
8x RTX 6000 Pro 96 GB640 GB25283534
8x RTX 6000 Pro 96 GB640 GB42465553

INT4 quantisation delivers the highest throughput because smaller weights mean faster memory transfers — the bottleneck for MoE models with many experts. FP8 requires at least 340 GB of VRAM, limiting it to 8-GPU setups with 80 GB cards.

Quality Comparison

Quality preservation is critical for DeepSeek, especially for its reasoning capabilities (R1). We evaluated each format on a mix of coding, reasoning, and general knowledge benchmarks.

FormatCoding (HumanEval)Reasoning (MATH)General (MMLU)Overall Quality
FP8 (native)98%99%99%Baseline
INT8 (W8A8)97%98%98%Near-lossless
AWQ 4-bit94%95%96%Minor degradation
GPTQ 4-bit93%94%95%Minor degradation
GGUF Q4_K_M93%94%96%Minor degradation

Scores are shown as a percentage of the FP8 baseline. FP8 and INT8 are essentially lossless. INT4 formats show 4-7% degradation on the most demanding tasks, which is acceptable for most production workloads but worth testing on your specific use case.

Best Format for Each GPU Config

GPU ConfigurationRecommended FormatWhy
8x RTX 6000 Pro 96 GBFP8 (native)Enough VRAM for full FP8, best quality, native support
8x RTX 6000 Pro 96 GBFP8 (native)640 GB is sufficient, use FP8 for maximum quality
4x RTX 6000 Pro 96 GBINT4 AWQ320 GB requires quantisation; AWQ balances quality and speed
8x RTX 5090INT4 AWQ256 GB too small for FP8; AWQ preferred for reasoning quality
8x RTX 3090INT4 GPTQ192 GB is tight; GPTQ maximises speed on limited VRAM

For context length scaling at each quantisation level, see our DeepSeek context length VRAM guide. For multi-GPU splitting strategies, read about model sharding for large models.

Conclusion

DeepSeek’s native FP8 training makes it uniquely well-suited for FP8 inference — use it whenever you have the VRAM budget (640+ GB). When memory is constrained, AWQ 4-bit offers the best quality/speed balance, while GPTQ 4-bit maximises throughput on tight configurations. Avoid GGUF for pure GPU serving unless you specifically need CPU offloading. For DeepSeek hosting options, explore the GPU configurations below.

Deploy DeepSeek on Dedicated GPUs

Multi-GPU clusters with up to 640 GB VRAM, built for large MoE model inference at any quantisation level.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?