The A100 80GB remains one of the most-deployed accelerators in production AI infrastructure. It pairs HBM2e (2.0 TB/s) with 80 GB of capacity, NVLink at 600 GB/s and MIG support — but stops at FP16/BF16. There is no native FP8. The RTX 4090 24GB brings native FP8 to a smaller VRAM envelope and roughly half the bandwidth, plus a 12x larger L2 cache. On UK GPU hosting the choice depends entirely on whether your workload is bandwidth-bound, capacity-bound, or precision-bound — and the answer is rarely obvious. This post explains which workloads pay back the A100 premium and which are better served by the cheaper, FP8-native consumer card.
Contents
- Spec sheet side by side
- The FP8 question — Ampere’s missing trick
- 80GB vs 24GB — capacity advantage
- Throughput across nine workloads
- Per-token economics
- Per-workload winner table
- vLLM serving examples
- Production gotchas
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | A100 80GB SXM (Ampere GA100) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC N7 | Two nodes ahead |
| SM count | 128 | 108 | +19% NVIDIA Ada |
| CUDA cores | 16,384 | 6,912 | 2.37x Ada |
| Tensor cores | 512 (4th gen, FP8) | 432 (3rd gen, no FP8) | Ada has FP8 |
| Boost clock | 2.52 GHz | 1.41 GHz | +79% Ada |
| VRAM | 24 GB GDDR6X | 80 GB HBM2e | 3.33x A100 |
| Memory bandwidth | 1008 GB/s | 2039 GB/s | 2.02x A100 |
| L2 cache | 72 MB | 40 MB | +80% Ada |
| FP16 dense TFLOPS | 165 | 312 | 1.89x A100 |
| FP8 dense TFLOPS | 660 (sparse) | None (FP16 fallback ~312) | 2.12x Ada |
| NVLink | None | NVLink 600 GB/s | A100 |
| MIG | None | 7-way | A100 |
| TDP | 450W | 400W (SXM) | +13% Ada |
| Approx UK price (2026) | £1,300 | £8,000-10,000 | ~7x A100 |
The interesting numbers: the 4090 has 79% higher clock, 80% larger L2, and native FP8 — but only half the bandwidth and a third of the VRAM. The A100 has the bigger, faster pool of memory but cannot use FP8, so its effective throughput on FP8-native workloads is closer than the bandwidth ratio suggests.
The FP8 question — Ampere’s missing trick
The A100 has no FP8 tensor instruction. When vLLM is asked for FP8 on an A100, it falls back to FP16 — halving the effective throughput per tensor-core op. There is no software fix. This is the single most important point of the comparison: for any modern LLM workload that can run FP8 (Llama, Mistral, Qwen, Phi, Gemma all have FP8 weights available), the 4090 has a precision-format advantage that partially offsets its bandwidth deficit.
For BF16 workloads (training, some research), the A100 wins because it has 2x the bandwidth and the format is its native sweet spot. For FP8 inference, the comparison is much closer than headline specs suggest. AWQ INT4 sits in between: bandwidth-dominated, so the A100’s HBM2e is decisive — Llama 70B AWQ on an A100 sustains ~52 t/s decode versus the 4090’s 22-24 t/s.
80GB vs 24GB — capacity advantage
| Model / configuration | RTX 4090 24GB | A100 80GB |
|---|---|---|
| Llama 3.1 8B FP8 | Comfortable | FP8 fallback to FP16 |
| Llama 3.1 70B AWQ INT4 | Tight | Comfortable |
| Llama 3.1 70B BF16 (140 GB) | OOM | OOM (single) |
| Llama 3.1 70B BF16 NVLink pair | n/a | Comfortable |
| Qwen 2.5 72B AWQ | OOM | Comfortable |
| Mixtral 8x22B AWQ (74 GB) | OOM | Comfortable |
| FLUX.1-dev FP16 (22 GB peak) | Comfortable | Trivial |
| 50 concurrent Llama 8B sessions | OOM at KV | Comfortable |
| Llama 8B QLoRA + grad accumulation | Tight | Comfortable |
| MIG 4-way Llama 8B isolated tenants | n/a | Comfortable |
Throughput across nine workloads
| Workload | RTX 4090 | A100 80GB | Winner |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | ~95 t/s (FP16 fallback) | 4090 +108% |
| Llama 3.1 8B AWQ decode b1 | 225 t/s | ~175 t/s | 4090 +29% |
| Llama 3.1 8B BF16 b1 | ~95 t/s | ~145 t/s | A100 +53% |
| Llama 3.1 70B AWQ b1 | 22-24 t/s | ~52 t/s | A100 +120% |
| Llama 3.1 70B FP8 | OOM | OOM (fits at FP16) | Neither single |
| Mixtral 8x22B AWQ | OOM | ~46 t/s | A100 only |
| SDXL 1024×1024 | 2.0s | ~1.6s | A100 +25% |
| FLUX.1-dev FP16 | 6.2s | ~4.2s | A100 +48% |
| 50 concurrent Llama 8B | OOM | ~2200 t/s agg | A100 only |
The 4090 wins decisively when FP8 is on the path. The A100 wins on bandwidth-dominated AWQ at large models, BF16 workloads, and anything needing 80GB.
Per-token economics
| Metric | RTX 4090 | A100 80GB SXM |
|---|---|---|
| TDP | 450W | 400W |
| Sustained LLM b32 | 360W | 320W |
| UK list price (2026) | £1,300 | £8,000-10,000 |
| £/aggregate t/s b32 (Llama 8B) | £1.18 | ~£10.00 |
| £/aggregate t/s for 70B AWQ | £59 | ~£155 |
| UK cloud rental (typical) | £0.70-1.20/hr | £1.80-2.50/hr |
| Annual electricity @ 24/7 £0.18/kWh | £568 | £505 |
For workloads the 4090 can run, it is dramatically cheaper per token. For Llama 70B and beyond, the A100’s larger VRAM and bandwidth shift the picture.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B FP8 | 4090 | FP8 native, 7x cheaper |
| 12-engineer Qwen Coder 32B AWQ | 4090 | Fits, FP8-friendly path |
| Llama 70B AWQ production endpoint | A100 | HBM2e bandwidth helps |
| Mixtral 8x22B AWQ | A100 | 4090 OOM |
| Multi-tenant 8B with MIG isolation | A100 | 4090 has no MIG |
| FLUX.1-dev studio at scale | A100 | 80GB headroom |
| SDXL freelance studio | 4090 | 2.0s vs 1.6s — 7x cheaper |
| Voice agent (Whisper + 8B) | 4090 | FP8 path is decisive |
| Production training (BF16) | A100 | NVLink + bandwidth |
| Capex-bounded MVP under £2k | 4090 | Only option |
vLLM serving examples
# RTX 4090 — Llama 3 8B FP8, native, 32-way batch
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 32 \
--gpu-memory-utilization 0.92
# A100 — same model, AWQ INT4 path (FP8 emulates to FP16, slower)
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--quantization awq_marlin \
--max-model-len 32768 --max-num-seqs 64 \
--gpu-memory-utilization 0.92
# A100 — Llama 70B AWQ at 32k context, the model 4090 cannot serve well
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--quantization awq_marlin \
--max-model-len 32768 --max-num-seqs 16 \
--gpu-memory-utilization 0.92
Production gotchas
- FP8 silently falls back on A100. vLLM accepts
--quantization fp8on A100 but runs FP16 underneath at half the throughput. Use AWQ Marlin for 8-bit equivalent perf. - A100 SXM is HGX-only. The PCIe variant exists at 1.94 TB/s vs SXM’s 2.04 TB/s. Both are HGX-class deployments.
- A100 cooling is a real cost. 400W per card x 4-8 in an HGX node needs serious airflow.
- NVLink topology matters. 4-way and 8-way NVLink configurations give different all-reduce performance. Plan workload to match.
- MIG partitioning is per-boot. Repartitioning requires GPU reset.
- A100 used market is active. Ex-cloud A100 SXM modules appear regularly at 40-60% of new price; verify warranty status.
- 4090 is not warranted in datacentre use. Strictly speaking, NVIDIA does not warrant 4090 for server deployment; the A100 (and 6000 Pro) is the supported choice.
Verdict
- Pick the RTX 4090 24GB if your model fits in 24GB at FP8 or AWQ; you serve fewer than ~50 concurrent users; you want the lowest £/token; or you need on-prem UK hosting.
- Pick the A100 80GB if you need 70B AWQ at high concurrency, Mixtral 8x22B, multi-tenant MIG isolation, or are doing BF16 training/fine-tuning at scale.
- Pick neither if you need native FP8 at 80GB capacity — go to H100 80GB instead. The H100 closes the gap.
For a 200-MAU SaaS, the 4090 is the right answer. For a research lab fine-tuning Llama 70B BF16 on a 4-card NVLink node, the A100 is the established workhorse.
Pick the FP8-native card for FP8-native workloads
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB with 4th-gen tensor cores — native FP8, no fallbacks, at a fraction of A100 cost.
Order the RTX 4090 24GBSee also: vs H100 80GB, vs MI300X 192GB, vs RTX 6000 Pro 96GB, vs RTX 3090 24GB, FP8 tensor cores on Ada, RTX 4090 spec breakdown, 2026 tier positioning.