The RTX 6000 Pro 96GB is Blackwell’s flagship workstation GPU: 24,064 CUDA cores, 96 GB of GDDR7 with ECC, roughly 1.4 TB/s bandwidth, and an NVLink-pair option to combine two cards into a 192 GB unified memory pool. At roughly £8,500 in the UK in 2026 it is six to eight times the price of the RTX 4090 24GB. For most AI inference workloads on UK GPU hosting that premium is wasted; for a specific set of large-model and ECC-mandatory workloads, it is the only single-card answer. This post explains exactly where each card belongs.
Contents
- Spec sheet side by side
- 96GB and what it unlocks
- ECC, NVLink and reliability features
- Per-workload throughput comparison
- Power and £/token economics
- Per-workload winner table
- vLLM serving examples
- Production gotchas
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | RTX 6000 Pro (Blackwell) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC 4NP | Refined |
| SM count | 128 | 188 | +47% |
| CUDA cores | 16,384 | 24,064 | +47% |
| Tensor cores | 512 (4th gen, FP8) | 752 (5th gen, FP8 + FP4) | +47% |
| Boost clock | 2.52 GHz | ~2.4 GHz | -5% |
| VRAM | 24 GB GDDR6X (21 Gbps) | 96 GB GDDR7 ECC (28 Gbps) | 4x capacity |
| Memory bandwidth | 1008 GB/s | ~1.4 TB/s | +39% |
| Memory bus | 384-bit | 512-bit | +33% |
| L2 cache | 72 MB | ~128 MB | +78% |
| FP16 dense TFLOPS | 165 | ~232 | +41% |
| FP8 dense TFLOPS | 660 (sparse) | ~930 | +41% |
| FP4 dense TFLOPS | None | ~1860 | New |
| ECC memory | No | Yes | Workstation grade |
| NVLink | None | Pair option (2x96GB = 192GB) | Multi-card scale |
| TDP | 450W | 300W | -33% |
| Form factor | 3.5-slot consumer | 2-slot workstation | Server-friendly |
Three things stand out: the 6000 Pro pairs 4x the VRAM with 39% more bandwidth and 33% lower TDP. NVIDIA achieved the lower TDP partly through stricter binning and partly through a flatter power curve targeted at sustained workstation duty cycles rather than gaming peaks. The 2-slot form factor matters in dense server deployments where 3.5-slot 4090s eat chassis real estate.
96GB and what it unlocks
| Model / configuration | RTX 4090 24GB | RTX 6000 Pro 96GB |
|---|---|---|
| Llama 3.1 8B FP8 + 64k context | Tight | Trivial |
| Llama 3.1 70B AWQ INT4 + 16k | Tight | Trivial (32k+) |
| Llama 3.1 70B FP8 (35 GB) | OOM | Comfortable |
| Llama 3.1 70B BF16 (140 GB) | OOM | OOM (single card) |
| Llama 3.1 70B BF16 NVLink pair (192 GB) | n/a | Comfortable |
| Qwen 2.5 72B FP8 (72 GB) | OOM | Comfortable |
| Mixtral 8x22B AWQ (74 GB) | OOM | Comfortable |
| DeepSeek V2 236B AWQ (118 GB) | OOM | OOM (single) |
| FLUX.1-dev FP16 + LoRA training | Tight | Trivial |
| 50 concurrent Llama 8B sessions | OOM at KV | Comfortable |
96GB unlocks: Llama 70B at FP8 (no INT4 quality compromise), Qwen 72B at FP8, Mixtral 8x22B, FLUX with full training rigs, and very high concurrency on smaller models. NVLink pair extends this to 192GB for Llama 70B BF16 or full DeepSeek V2.
ECC, NVLink and reliability features
ECC is the workstation-grade feature most often hand-waved in inference comparisons. Single-bit memory errors do happen on consumer GDDR6X — usually rarely enough to ignore for a chatbot, but unacceptable for production fine-tuning where a corrupted gradient can poison a 24-hour training run. The 6000 Pro’s ECC catches and corrects single-bit errors transparently and reports double-bit errors. Combined with NVIDIA’s longer driver support cycle (workstation drivers get 5+ years of LTS) and warranty (3-year ProSupport vs 1-year consumer), the 6000 Pro is the right card for any deployment where uptime and data integrity are contractually required.
NVLink at 900 GB/s between paired 6000 Pros is the other big-ticket feature. The 4090 has no NVLink — multi-card inference goes over PCIe Gen 4 at ~28 GB/s, which is fine for small all-reduce in tensor-parallel inference but becomes a bottleneck for training. See multi-card pairing for the consumer-card workarounds.
Per-workload throughput comparison
| Workload | RTX 4090 | RTX 6000 Pro | Uplift |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | 225 t/s | 1.14x |
| Llama 3.1 8B FP8 batch 32 agg | 1100 t/s | 1380 t/s | 1.25x |
| Llama 3.1 70B AWQ decode b1 | 22-24 t/s | 38 t/s | 1.65x |
| Llama 3.1 70B FP8 decode b1 | OOM | 32 t/s | 6000 Pro only |
| Qwen 2.5 72B FP8 decode b1 | OOM | 22 t/s | 6000 Pro only |
| Mixtral 8x22B AWQ | OOM | 26 t/s | 6000 Pro only |
| SDXL 1024×1024 | 2.0s | 1.7s | 1.18x |
| FLUX.1-dev FP16 | 6.2s | 4.5s | 1.38x |
| QLoRA Llama 8B (steps/s) | 2.6 | 3.3 | 1.27x |
| 50 concurrent Llama 8B FP8 | OOM | ~3500 t/s aggregate | 6000 Pro only |
For workloads both cards run, the 6000 Pro is 1.14-1.65x faster — the larger die and bandwidth pull ahead, but the gap is smaller than 6x price would suggest. For workloads only the 6000 Pro can run, you are paying for capability, not speed.
Power and £/token economics
| Metric | RTX 4090 | RTX 6000 Pro |
|---|---|---|
| TDP | 450W | 300W |
| Sustained LLM b32 | 360W | 250W |
| Tokens/Joule (Llama 8B FP8 b32) | 3.05 | 5.52 |
| UK price (typical 2026) | £1,300 | £8,500 |
| £/aggregate t/s (b32) | £1.18 | £6.16 |
| £/GB VRAM | £54 | £89 |
| Annual electricity @ 24/7 £0.18/kWh | £568 | £394 |
| £/year capex (3-yr) | £433 | £2,833 |
| Total £/year | £1,001 | £3,227 |
For workloads where both cards work, the 4090 wins on £/token by a factor of 5. The 6000 Pro’s better tokens-per-joule is real but doesn’t close the gap meaningfully — capex dominates. The 6000 Pro pays back only when you genuinely need the VRAM, ECC or NVLink. See the monthly hosting cost and tokens-per-watt analyses.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | 5x cheaper, throughput suffices |
| 12-engineer Qwen Coder 32B AWQ | 4090 | Fits, 65 t/s is sufficient |
| Llama 70B FP8 production endpoint | 6000 Pro | 4090 cannot fit FP8 |
| Qwen 72B coding endpoint | 6000 Pro | 4090 OOM |
| Mixtral 8x22B | 6000 Pro | 4090 OOM |
| 50-100 concurrent 8B sessions | 6000 Pro | 4090 KV cache exhausted |
| Regulated industry (finance, medical) | 6000 Pro | ECC mandatory |
| Production training (24-hr+ runs) | 6000 Pro | ECC + NVLink + warranty |
| FLUX studio at scale | 6000 Pro | FP16 + caching headroom |
| Capex-bounded MVP under £2k | 4090 | Only option |
vLLM serving examples
# RTX 4090 — Llama 70B AWQ INT4, the biggest model that fits
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--quantization awq_marlin --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 4 \
--gpu-memory-utilization 0.94
# RTX 6000 Pro — same model at FP8 (no INT4 quality loss), 32k context
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 --max-num-seqs 16 \
--gpu-memory-utilization 0.92
# RTX 6000 Pro NVLink pair — Llama 70B at full BF16 across 2 cards
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 8 \
--gpu-memory-utilization 0.90
Production gotchas
- 6000 Pro is not always faster on smaller models. For Llama 8B, the 4090 is within 15-25% — the 6000 Pro’s extra silicon is wasted. Don’t pay 6x for 1.2x.
- NVLink requires NVLink bridges and chassis support. Not every server can host paired 6000 Pros; budget for the bridges and the chassis upgrade.
- ECC has a real performance cost. Roughly 5-7% lower effective bandwidth versus non-ECC GDDR7. The headline 1.4 TB/s number assumes ECC enabled.
- Workstation drivers have different release cadence. Production-validated NVIDIA Studio / Enterprise drivers lag Game Ready by 2-4 weeks.
- 4090 has no warranty in datacentre use. Strictly, NVIDIA does not warrant the 4090 for server deployment. The 6000 Pro is the supported choice.
- 96GB VRAM does not guarantee 96GB usable. vLLM’s
--gpu-memory-utilizationstill applies; expect 88-92 GB usable for KV cache and weights combined. - 2-slot form factor is great until you need cooling headroom. Densely packed 6000 Pros in a 4U chassis need aggressive airflow; a single 4090 with three fans often runs cooler in isolation.
Verdict
- Pick the RTX 4090 24GB if your model fits in 24GB; you do not need ECC; you do not need NVLink; or you are price-sensitive. This describes the majority of inference workloads in 2026. See the 4090 to 6000 Pro upgrade guide.
- Pick the RTX 6000 Pro 96GB if you serve 70B+ at FP8, need 32GB+ for FLUX or production training, require ECC for regulated workloads, want NVLink for tensor-parallel scaling, or need single-card serve of Mixtral 8x22B / Qwen 72B / DeepSeek-class models.
- Pick neither if you need sub-second 70B inference on 100+ concurrent users — go to H100 80GB with HBM3 bandwidth.
For a 200-MAU SaaS, the 4090 is the right answer. For a regulated fintech building a Llama 70B FP8 endpoint with audit requirements, the 6000 Pro is the only defensible choice.
Start on the 4090, scale to the 6000 Pro when capacity demands it
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB with a clean upgrade path. Run your MVP affordably, then move to a workstation card when 24GB is the bottleneck.
Order the RTX 4090 24GBSee also: vs RTX 5090 32GB, vs H100 80GB, vs A100 80GB, RTX 4090 spec breakdown, multi-card pairing, upgrade to 6000 Pro, 2026 tier positioning.