RTX 3050 - Order Now
Home / Blog / Benchmarks / DeepSeek R1 Distill Tokens/sec by GPU
Benchmarks

DeepSeek R1 Distill Tokens/sec by GPU

Benchmark results for DeepSeek R1 Distill inference speed across six GPUs, comparing FP16, INT8, and INT4 quantisation with cost-per-token analysis.

DeepSeek R1 Distill Benchmark Overview

DeepSeek R1 Distill is a distilled reasoning model that brings chain-of-thought capabilities into a compact 7B-parameter package. It excels at maths, logic, and step-by-step problem solving while remaining deployable on a single dedicated GPU server. We benchmark inference speed across six GPUs to guide your hardware selection.

All tests were conducted on GigaGPU servers with vLLM, using a 512-token input and 256-token output generation. DeepSeek R1 Distill 7B requires approximately 14 GB at FP16. For our full testing methodology, refer to the tokens per second benchmark page.

Tokens/sec Results by GPU

GPUVRAMR1 Distill 7B FP16 (tok/s)Notes
RTX 30506 GBN/AInsufficient VRAM for FP16
RTX 40608 GBN/AInsufficient VRAM for FP16
RTX 4060 Ti16 GB30 tok/sComfortable fit
RTX 309024 GB42 tok/sPlenty of headroom
RTX 508016 GB66 tok/sNext-gen advantage
RTX 509032 GB92 tok/sTop single-GPU speed

DeepSeek R1 Distill performs similarly to other 7B models at FP16. The RTX 5090 at 92 tok/s provides excellent latency for interactive reasoning applications.

FP16 vs INT8 vs INT4 Comparison

Quantisation opens R1 Distill to budget GPUs while also boosting throughput on higher-end cards. See our quantisation speed comparison for a detailed analysis of precision trade-offs.

GPUFP16 (tok/s)INT8 (tok/s)INT4 (tok/s)
RTX 3050 (6 GB)N/AN/A9 tok/s
RTX 4060 (8 GB)N/A17 tok/s22 tok/s
RTX 4060 Ti (16 GB)3036 tok/s44 tok/s
RTX 3090 (24 GB)4250 tok/s59 tok/s
RTX 5080 (16 GB)6676 tok/s89 tok/s
RTX 5090 (32 GB)92108 tok/s126 tok/s

INT4 delivers 35-47% more tokens per second than FP16, with reasoning tasks showing only minor quality impacts at 4-bit precision. For maths-heavy workloads where accuracy is paramount, INT8 offers a solid middle ground.

Cost Efficiency Analysis

GPUFP16 tok/sApprox. Monthly Costtok/s per Pound
RTX 4060 Ti30~£750.40
RTX 309042~£1100.38
RTX 508066~£1600.41
RTX 509092~£2500.37

The RTX 5080 narrowly leads on cost efficiency at FP16, with the RTX 4060 Ti close behind. Check our best GPU for DeepSeek guide for more detailed recommendations.

GPU Recommendations

  • Best budget: RTX 4060 Ti — 30 tok/s FP16 handles development and low-traffic reasoning APIs well.
  • Best value: RTX 5080 — highest cost efficiency with strong 66 tok/s at FP16.
  • Best performance: RTX 5090 — 92 tok/s for real-time reasoning chatbot deployments.
  • Budget INT4: RTX 4060 — 22 tok/s at INT4 for experimentation and prototyping.

Compare DeepSeek R1 Distill with other 7B models in our Qwen 2.5 7B benchmark or the Gemma 2 9B results. Browse all results in the Benchmarks category.

Conclusion

DeepSeek R1 Distill brings powerful reasoning to a deployable 7B model. With FP16 running comfortably on 16 GB GPUs and INT4 making even 6-8 GB cards viable, it is one of the most accessible reasoning models available for dedicated server deployment.

Deploy DeepSeek R1 Distill on Dedicated Hardware

Bare-metal GPU servers with full root access, optimised for LLM reasoning workloads.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?