Table of Contents
DeepSeek V3 in April 2026
DeepSeek V3 stands as the highest-performing open-source LLM available in April 2026, matching GPT-4o across most benchmarks while running on self-hosted hardware. Its Mixture-of-Experts architecture with 671 billion total parameters but only ~37 billion active per token makes it remarkably efficient for its quality level. This performance report covers real-world throughput and deployment data from testing on GigaGPU dedicated servers.
DeepSeek V3 is available under an MIT license, making it fully commercially usable with no restrictions. See the licensing guide for details.
Throughput Benchmarks by GPU
Tested via vLLM at 10 concurrent users, 512-token prompt, 256-token generation:
| GPU Configuration | Precision | Total tok/s | First Token | Per-User tok/s |
|---|---|---|---|---|
| 2x RTX 5090 | FP16 (active) | 72 | 185 ms | 7.2 |
| 4x RTX 5090 | FP16 (active) | 130 | 105 ms | 13.0 |
| RTX 6000 Pro 96 GB | FP16 (active) | 88 | 145 ms | 8.8 |
| RTX 6000 Pro 96 GB | FP16 (active) | 155 | 78 ms | 15.5 |
| 1x RTX 5090 | Q4 (expert) | 38 | 280 ms | 3.8 |
The MoE architecture allows DeepSeek V3 to run on dual RTX 5090s, which is remarkable for a 671B-parameter model. Throughput on consumer hardware is practical for production deployments serving 5-15 concurrent users.
Memory and VRAM Requirements
| Configuration | VRAM Required | Minimum Hardware |
|---|---|---|
| FP16 (full weights) | ~320 GB | 4x RTX 6000 Pro 96 GB |
| FP16 (active experts only) | ~80 GB | 2x RTX 5090 or 1x RTX 6000 Pro |
| Q4 quantised | ~45 GB | 2x RTX 5090 |
| Q4 (aggressive offload) | ~22 GB | 1x RTX 5090* |
*Single RTX 5090 with CPU offloading incurs significant throughput reduction but is usable for low-concurrency deployments.
Quality Benchmark Scores
| Benchmark | DeepSeek V3 | GPT-4o | LLaMA 3.1 70B |
|---|---|---|---|
| MMLU | 88.5 | 88.7 | 82.0 |
| HumanEval | 82.6 | 90.2 | 72.5 |
| GSM8K | 92.3 | 95.8 | 85.2 |
| MT-Bench | 9.1 | 9.3 | 8.4 |
DeepSeek V3 trails GPT-4o by only 2-8% on coding benchmarks while exceeding LLaMA 3.1 70B by 6-14%. For a self-hostable model, this quality level is exceptional. See the full LLM benchmark rankings for broader comparisons.
Deployment Configuration
The recommended deployment for DeepSeek V3 in April 2026 uses vLLM with tensor parallelism across 2 or 4 GPUs. On a multi-GPU cluster with 2x RTX 5090, the model loads in approximately 45 seconds and begins serving immediately.
For teams that do not need DeepSeek V3’s full quality, consider LLaMA 3.1 70B which runs faster on the same hardware. The quality difference matters most for coding, math, and complex reasoning tasks. For general conversation, the gap is smaller. Compare throughput using the tokens per second benchmark.
Deploy DeepSeek V3 on Dedicated Hardware
GPT-4o-class performance on your own GPU server. MIT licensed, fully private, no per-token fees.
View GPU ServersPerformance Verdict
DeepSeek V3 delivers the closest performance to GPT-4o of any self-hostable model in April 2026. The MoE architecture makes it practical on consumer GPU hardware, and the MIT license removes all commercial use barriers. For teams seeking the best open-source model quality on self-hosted infrastructure, DeepSeek V3 is the top choice.
Cost analysis for running DeepSeek V3 is available in the inference cost per query guide. For budget-constrained deployments, LLaMA 3.1 70B on a single RTX 5090 offers excellent quality at lower cost, covered in the LLaMA 3.1 performance report.