Home / Blog / Model Guides / DeepSeek Quantization: Best Format for Each GPU

Model Guides

DeepSeek Quantization: Best Format for Each GPU

Guide to choosing the best quantisation format for DeepSeek V3 and R1 across different GPU configurations, comparing FP8, INT4, GPTQ, AWQ, and GGUF.

Model Guides April 17, 2026 3 min read gigagpu

Table of Contents

Why DeepSeek Quantisation Is Different
Available Quantisation Formats
Performance by Format and GPU
Quality Comparison
Best Format for Each GPU Config
Conclusion

Why DeepSeek Quantisation Is Different

DeepSeek V3 and R1 are Mixture-of-Experts (MoE) models with 671 billion total parameters, making quantisation not just an optimisation but a necessity for practical dedicated GPU hosting. Unlike dense models where quantisation is optional, DeepSeek’s sheer size means you almost certainly need some form of weight compression to fit it on available hardware. The MoE architecture also introduces unique considerations — expert routing weights must remain precise while individual expert layers can tolerate more aggressive quantisation.

For baseline VRAM requirements see our DeepSeek VRAM requirements guide. For context on how different formats work generally, read the GPTQ vs AWQ vs GGUF quantisation guide.

Available Quantisation Formats

DeepSeek V3 was trained natively with FP8 precision, which gives it an advantage over models that must be post-training quantised from FP16. Here are the primary options:

FP8 (native): the default recommended precision. No post-training quantisation needed — weights are already in FP8 from training. Minimal quality loss.
GPTQ 4-bit: aggressive weight-only quantisation using calibration data. Reduces model to ~175 GB. Requires ExLlama v2 kernels for optimal speed.
AWQ 4-bit: activation-aware quantisation that preserves important weights at higher precision. Similar size to GPTQ with slightly better quality on reasoning tasks.
GGUF (various): llama.cpp format supporting CPU/GPU hybrid inference. Useful for partial offloading when GPU VRAM is insufficient for the full model.
INT8 (W8A8): 8-bit weight and activation quantisation. Larger than INT4 but better quality preservation.

Performance by Format and GPU

The table below shows approximate output speed in tokens per second for DeepSeek V3 at different quantisation levels. Due to the model’s size, all configurations use tensor parallelism across multiple GPUs. Measured with vLLM at 512 input / 256 output tokens.

GPU Config	Total VRAM	FP8 (tok/s)	INT8 (tok/s)	INT4 GPTQ (tok/s)	INT4 AWQ (tok/s)
8x RTX 3090	192 GB	N/A	N/A	8	7
8x RTX 5090	256 GB	N/A	12	18	17
4x RTX 6000 Pro 96 GB	320 GB	N/A	15	22	21
8x RTX 6000 Pro 96 GB	640 GB	25	28	35	34
8x RTX 6000 Pro 96 GB	640 GB	42	46	55	53

INT4 quantisation delivers the highest throughput because smaller weights mean faster memory transfers — the bottleneck for MoE models with many experts. FP8 requires at least 340 GB of VRAM, limiting it to 8-GPU setups with 80 GB cards.

Quality Comparison

Quality preservation is critical for DeepSeek, especially for its reasoning capabilities (R1). We evaluated each format on a mix of coding, reasoning, and general knowledge benchmarks.

Format	Coding (HumanEval)	Reasoning (MATH)	General (MMLU)	Overall Quality
FP8 (native)	98%	99%	99%	Baseline
INT8 (W8A8)	97%	98%	98%	Near-lossless
AWQ 4-bit	94%	95%	96%	Minor degradation
GPTQ 4-bit	93%	94%	95%	Minor degradation
GGUF Q4_K_M	93%	94%	96%	Minor degradation

Scores are shown as a percentage of the FP8 baseline. FP8 and INT8 are essentially lossless. INT4 formats show 4-7% degradation on the most demanding tasks, which is acceptable for most production workloads but worth testing on your specific use case.

Best Format for Each GPU Config

GPU Configuration	Recommended Format	Why
8x RTX 6000 Pro 96 GB	FP8 (native)	Enough VRAM for full FP8, best quality, native support
8x RTX 6000 Pro 96 GB	FP8 (native)	640 GB is sufficient, use FP8 for maximum quality
4x RTX 6000 Pro 96 GB	INT4 AWQ	320 GB requires quantisation; AWQ balances quality and speed
8x RTX 5090	INT4 AWQ	256 GB too small for FP8; AWQ preferred for reasoning quality
8x RTX 3090	INT4 GPTQ	192 GB is tight; GPTQ maximises speed on limited VRAM

For context length scaling at each quantisation level, see our DeepSeek context length VRAM guide. For multi-GPU splitting strategies, read about model sharding for large models.

Conclusion

DeepSeek’s native FP8 training makes it uniquely well-suited for FP8 inference — use it whenever you have the VRAM budget (640+ GB). When memory is constrained, AWQ 4-bit offers the best quality/speed balance, while GPTQ 4-bit maximises throughput on tight configurations. Avoid GGUF for pure GPU serving unless you specifically need CPU offloading. For DeepSeek hosting options, explore the GPU configurations below.

Deploy DeepSeek on Dedicated GPUs

Multi-GPU clusters with up to 640 GB VRAM, built for large MoE model inference at any quantisation level.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

DeepSeek Quantization: Best Format for Each GPU

Why DeepSeek Quantisation Is Different

Available Quantisation Formats

Performance by Format and GPU

Quality Comparison

Best Format for Each GPU Config

Conclusion

Deploy DeepSeek on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

DeepSeek Quantization: Best Format for Each GPU

Why DeepSeek Quantisation Is Different

Available Quantisation Formats

Performance by Format and GPU

Quality Comparison

Best Format for Each GPU Config

Conclusion

Deploy DeepSeek on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB for Cohere Aya: Multilingual LLM Hosting Guide

RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Command R 35B Self-Hosted

Gemma 2 2B vs 9B vs 27B: Choosing the Right Size

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?