Table of Contents
Blackwell Arrives in Consumer Hardware
NVIDIA’s Blackwell architecture has made the jump from datacentre to desktop. The RTX 5080 and RTX 5090 represent a generational leap for anyone running AI inference on dedicated GPU hosting, bringing architectural features previously reserved for the B100 and B200 down to consumer price points. For teams self-hosting open source models, this changes the cost-performance equation significantly.
The previous generation — Ada Lovelace in the RTX 40-series — already made consumer GPUs viable for production open source LLM hosting. Blackwell pushes that further with higher memory bandwidth, improved tensor core throughput, and native support for FP4 inference. Whether you are running LLaMA, DeepSeek, or Stable Diffusion, these cards merit serious consideration.
Here is what the specs actually mean for real-world AI workloads and how we see them fitting into the AI hosting landscape.
RTX 5080 & 5090 Specs at a Glance
Before diving into architecture, here are the numbers that matter for inference workloads:
| Specification | RTX 5080 | RTX 5090 | RTX 5090 | RTX 3090 |
|---|---|---|---|---|
| Architecture | Blackwell | Blackwell | Ada Lovelace | Ampere |
| VRAM | 16 GB GDDR7 | 32 GB GDDR7 | 24 GB GDDR6X | 24 GB GDDR6X |
| Memory Bandwidth | 960 GB/s | 1,792 GB/s | 1,008 GB/s | 936 GB/s |
| CUDA Cores | 10,752 | 21,760 | 16,384 | 10,496 |
| Tensor Cores (5th Gen) | 336 | 680 | 512 (4th Gen) | 328 (3rd Gen) |
| FP4 Tensor TOPS | 1,801 | 3,352 | N/A | N/A |
| TDP | 360W | 575W | 450W | 350W |
| MSRP | $999 | $1,999 | $1,599 | $1,499 |
The standout figure is the RTX 5090’s 32 GB of GDDR7 — the largest VRAM pool on any consumer GPU. For a full comparison across all cards we offer, see our best GPU for LLM inference benchmark.
What Blackwell Architecture Changes for AI
Three Blackwell features matter most for inference workloads on dedicated servers:
1. Fifth-generation Tensor Cores with FP4 support. Previous generations bottomed out at FP8 (Ada) or FP16/INT8 (Ampere). FP4 inference halves the memory footprint of quantised models compared to FP8, letting you fit larger models in the same VRAM or double your batch size. For 7B-parameter LLMs, FP4 on the RTX 5080’s 16 GB delivers throughput that previously required 24 GB cards.
2. GDDR7 memory with higher bandwidth. LLM inference is almost always memory-bandwidth-bound. The RTX 5090’s 1,792 GB/s represents a 78% improvement over the RTX 5090, translating directly into more tokens per second. Even the RTX 5080 holds its own against the 5090 at 960 GB/s despite having less VRAM.
3. Improved NVLink and multi-GPU support. While consumer Blackwell cards lack the full NVLink fabric of the B200, NVIDIA has improved PCIe Gen5 throughput, making multi-GPU inference with vLLM tensor parallelism more practical on consumer hardware.
Inference Performance: Blackwell vs Ada vs Ampere
We benchmarked all four GPUs on common inference workloads using vLLM. All tests ran on identical server configurations in our UK datacenter:
| Workload | RTX 5090 | RTX 5080 | RTX 5090 | RTX 3090 |
|---|---|---|---|---|
| LLaMA 3 8B (FP16, tok/s) | 95 | 68 | 62 | 42 |
| Mistral 7B (FP16, tok/s) | 100 | 72 | 66 | 45 |
| DeepSeek 7B (FP16, tok/s) | 88 | 65 | 58 | 40 |
| LLaMA 3 8B (FP4, tok/s) | 142 | 110 | N/A | N/A |
| SDXL (images/min, 1024px) | 14.2 | 9.8 | 8.1 | 4.3 |
The FP4 row is where Blackwell truly separates itself. A 49% throughput gain over FP16 on the same card, with negligible quality loss on most LLM benchmarks. Full token-level data is on our tokens per second benchmark page.
For image generation workloads like Stable Diffusion XL, the RTX 5090 delivers over three times the throughput of the RTX 3090. Teams running vision model hosting or multimodal model hosting pipelines will see the biggest gains here.
VRAM Implications for LLM and Image Generation Workloads
VRAM determines which models fit on a single card. Here is how the Blackwell consumer lineup compares for common deployments:
| Model / Workload | VRAM Required (FP16) | VRAM Required (FP4) | Fits on 5080 (16 GB)? | Fits on 5090 (32 GB)? |
|---|---|---|---|---|
| Mistral 7B | ~14 GB | ~4 GB | Yes (FP16) | Yes |
| LLaMA 3 8B | ~16 GB | ~5 GB | Tight (FP16) | Yes |
| DeepSeek-V3 16B | ~32 GB | ~9 GB | FP4 only | Yes (FP16) |
| Qwen 2.5 72B | ~144 GB | ~40 GB | No | No (multi-GPU) |
| SDXL (1024px, batch 4) | ~12 GB | — | Yes | Yes |
The RTX 5090 is the first consumer GPU that can run a 16B-parameter model at full FP16 precision on a single card. That opens the door to hosting Qwen 2.5 and DeepSeek-V3 distilled variants without quantisation, which matters for tasks where output quality is paramount.
The RTX 5080 at 16 GB is more constrained, but FP4 quantisation makes it a strong option for 7B-8B models where the raw speed advantage over the RTX 3090 justifies the VRAM trade-off. See our RTX 3090 vs RTX 5090 comparison for context on how previous generations handled this trade-off.
Blackwell GPU Servers Now Available
RTX 5080 and RTX 5090 dedicated servers with full root access, NVMe storage, and 1Gbps networking — deployed same-day from our UK datacenter.
Browse GPU ServersAvailability on GigaGPU
Both RTX 5080 and RTX 5090 servers are available for immediate deployment on GigaGPU. Every server ships with Ubuntu 22.04, pre-installed NVIDIA drivers, and full root access. Our self-hosting guide walks you from zero to a production LLM API in under an hour.
We have maintained stock throughout the initial launch period by working directly with UK distributors. Unlike cloud GPU platforms that charge by the hour, GigaGPU offers fixed monthly pricing — you know exactly what you will spend. For cost comparisons with hourly providers, see our RunPod alternatives analysis.
Which Card Should You Choose?
Choose the RTX 5090 if:
- You need to run 13B-16B models at FP16 precision on a single card
- You are serving high-throughput LLM APIs where every token per second counts
- You run image generation or speech model workloads that benefit from large VRAM pools
Choose the RTX 5080 if:
- Your primary workload is 7B-8B models where FP4 quantisation is acceptable
- You want Blackwell’s speed advantage over Ada/Ampere at a lower price point
- Budget matters and you can work within 16 GB of VRAM
Keep the RTX 3090 if:
- Cost per token is your top priority and raw speed is secondary
- You need 24 GB VRAM for FP16 7B models with room for KV cache
- You are running workloads where the cost per million tokens matters more than latency
Blackwell is not a universal upgrade. The RTX 3090 remains the cost-efficiency champion for teams optimising spend. But for latency-sensitive workloads, larger models, or image generation, the RTX 5080 and 5090 set a new standard for what consumer GPUs can do in a dedicated hosting environment.