Home / Blog / GPU Comparisons / RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e

GPU Comparisons

RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e

How a 4th-gen tensor core consumer card with native FP8 compares to the Ampere-era A100 80GB with 2 TB/s HBM2e bandwidth and no native FP8 — capacity vs precision, two different eras, one sharp choice.

GPU Comparisons May 4, 2026 5 min read gigagpu

The A100 80GB remains one of the most-deployed accelerators in production AI infrastructure. It pairs HBM2e (2.0 TB/s) with 80 GB of capacity, NVLink at 600 GB/s and MIG support — but stops at FP16/BF16. There is no native FP8. The RTX 4090 24GB brings native FP8 to a smaller VRAM envelope and roughly half the bandwidth, plus a 12x larger L2 cache. On UK GPU hosting the choice depends entirely on whether your workload is bandwidth-bound, capacity-bound, or precision-bound — and the answer is rarely obvious. This post explains which workloads pay back the A100 premium and which are better served by the cheaper, FP8-native consumer card.

Spec sheet side by side

Spec	RTX 4090 (Ada AD102)	A100 80GB SXM (Ampere GA100)	Delta
Process	TSMC 4N	TSMC N7	Two nodes ahead
SM count	128	108	+19% NVIDIA Ada
CUDA cores	16,384	6,912	2.37x Ada
Tensor cores	512 (4th gen, FP8)	432 (3rd gen, no FP8)	Ada has FP8
Boost clock	2.52 GHz	1.41 GHz	+79% Ada
VRAM	24 GB GDDR6X	80 GB HBM2e	3.33x A100
Memory bandwidth	1008 GB/s	2039 GB/s	2.02x A100
L2 cache	72 MB	40 MB	+80% Ada
FP16 dense TFLOPS	165	312	1.89x A100
FP8 dense TFLOPS	660 (sparse)	None (FP16 fallback ~312)	2.12x Ada
NVLink	None	NVLink 600 GB/s	A100
MIG	None	7-way	A100
TDP	450W	400W (SXM)	+13% Ada
Approx UK price (2026)	£1,300	£8,000-10,000	~7x A100

The interesting numbers: the 4090 has 79% higher clock, 80% larger L2, and native FP8 — but only half the bandwidth and a third of the VRAM. The A100 has the bigger, faster pool of memory but cannot use FP8, so its effective throughput on FP8-native workloads is closer than the bandwidth ratio suggests.

The FP8 question — Ampere’s missing trick

The A100 has no FP8 tensor instruction. When vLLM is asked for FP8 on an A100, it falls back to FP16 — halving the effective throughput per tensor-core op. There is no software fix. This is the single most important point of the comparison: for any modern LLM workload that can run FP8 (Llama, Mistral, Qwen, Phi, Gemma all have FP8 weights available), the 4090 has a precision-format advantage that partially offsets its bandwidth deficit.

For BF16 workloads (training, some research), the A100 wins because it has 2x the bandwidth and the format is its native sweet spot. For FP8 inference, the comparison is much closer than headline specs suggest. AWQ INT4 sits in between: bandwidth-dominated, so the A100’s HBM2e is decisive — Llama 70B AWQ on an A100 sustains ~52 t/s decode versus the 4090’s 22-24 t/s.

80GB vs 24GB — capacity advantage

Model / configuration	RTX 4090 24GB	A100 80GB
Llama 3.1 8B FP8	Comfortable	FP8 fallback to FP16
Llama 3.1 70B AWQ INT4	Tight	Comfortable
Llama 3.1 70B BF16 (140 GB)	OOM	OOM (single)
Llama 3.1 70B BF16 NVLink pair	n/a	Comfortable
Qwen 2.5 72B AWQ	OOM	Comfortable
Mixtral 8x22B AWQ (74 GB)	OOM	Comfortable
FLUX.1-dev FP16 (22 GB peak)	Comfortable	Trivial
50 concurrent Llama 8B sessions	OOM at KV	Comfortable
Llama 8B QLoRA + grad accumulation	Tight	Comfortable
MIG 4-way Llama 8B isolated tenants	n/a	Comfortable

Throughput across nine workloads

Workload	RTX 4090	A100 80GB	Winner
Llama 3.1 8B FP8 decode b1	198 t/s	~95 t/s (FP16 fallback)	4090 +108%
Llama 3.1 8B AWQ decode b1	225 t/s	~175 t/s	4090 +29%
Llama 3.1 8B BF16 b1	~95 t/s	~145 t/s	A100 +53%
Llama 3.1 70B AWQ b1	22-24 t/s	~52 t/s	A100 +120%
Llama 3.1 70B FP8	OOM	OOM (fits at FP16)	Neither single
Mixtral 8x22B AWQ	OOM	~46 t/s	A100 only
SDXL 1024×1024	2.0s	~1.6s	A100 +25%
FLUX.1-dev FP16	6.2s	~4.2s	A100 +48%
50 concurrent Llama 8B	OOM	~2200 t/s agg	A100 only

The 4090 wins decisively when FP8 is on the path. The A100 wins on bandwidth-dominated AWQ at large models, BF16 workloads, and anything needing 80GB.

Per-token economics

Metric	RTX 4090	A100 80GB SXM
TDP	450W	400W
Sustained LLM b32	360W	320W
UK list price (2026)	£1,300	£8,000-10,000
£/aggregate t/s b32 (Llama 8B)	£1.18	~£10.00
£/aggregate t/s for 70B AWQ	£59	~£155
UK cloud rental (typical)	£0.70-1.20/hr	£1.80-2.50/hr
Annual electricity @ 24/7 £0.18/kWh	£568	£505

For workloads the 4090 can run, it is dramatically cheaper per token. For Llama 70B and beyond, the A100’s larger VRAM and bandwidth shift the picture.

Per-workload winner table

Workload	Winner	Why
200-MAU SaaS RAG on Llama 8B FP8	4090	FP8 native, 7x cheaper
12-engineer Qwen Coder 32B AWQ	4090	Fits, FP8-friendly path
Llama 70B AWQ production endpoint	A100	HBM2e bandwidth helps
Mixtral 8x22B AWQ	A100	4090 OOM
Multi-tenant 8B with MIG isolation	A100	4090 has no MIG
FLUX.1-dev studio at scale	A100	80GB headroom
SDXL freelance studio	4090	2.0s vs 1.6s — 7x cheaper
Voice agent (Whisper + 8B)	4090	FP8 path is decisive
Production training (BF16)	A100	NVLink + bandwidth
Capex-bounded MVP under £2k	4090	Only option

vLLM serving examples

# RTX 4090 — Llama 3 8B FP8, native, 32-way batch
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

# A100 — same model, AWQ INT4 path (FP8 emulates to FP16, slower)
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 32768 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92

# A100 — Llama 70B AWQ at 32k context, the model 4090 cannot serve well
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.92

Production gotchas

FP8 silently falls back on A100. vLLM accepts --quantization fp8 on A100 but runs FP16 underneath at half the throughput. Use AWQ Marlin for 8-bit equivalent perf.
A100 SXM is HGX-only. The PCIe variant exists at 1.94 TB/s vs SXM’s 2.04 TB/s. Both are HGX-class deployments.
A100 cooling is a real cost. 400W per card x 4-8 in an HGX node needs serious airflow.
NVLink topology matters. 4-way and 8-way NVLink configurations give different all-reduce performance. Plan workload to match.
MIG partitioning is per-boot. Repartitioning requires GPU reset.
A100 used market is active. Ex-cloud A100 SXM modules appear regularly at 40-60% of new price; verify warranty status.
4090 is not warranted in datacentre use. Strictly speaking, NVIDIA does not warrant 4090 for server deployment; the A100 (and 6000 Pro) is the supported choice.

Verdict

Pick the RTX 4090 24GB if your model fits in 24GB at FP8 or AWQ; you serve fewer than ~50 concurrent users; you want the lowest £/token; or you need on-prem UK hosting.
Pick the A100 80GB if you need 70B AWQ at high concurrency, Mixtral 8x22B, multi-tenant MIG isolation, or are doing BF16 training/fine-tuning at scale.
Pick neither if you need native FP8 at 80GB capacity — go to H100 80GB instead. The H100 closes the gap.

For a 200-MAU SaaS, the 4090 is the right answer. For a research lab fine-tuning Llama 70B BF16 on a 4-card NVLink node, the A100 is the established workhorse.

Pick the FP8-native card for FP8-native workloads

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB with 4th-gen tensor cores — native FP8, no fallbacks, at a fraction of A100 cost.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e

Contents

Spec sheet side by side

The FP8 question — Ampere’s missing trick

80GB vs 24GB — capacity advantage

Throughput across nine workloads

Per-token economics

Per-workload winner table

vLLM serving examples

Production gotchas

Verdict

Pick the FP8-native card for FP8-native workloads

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e

Contents

Spec sheet side by side

The FP8 question — Ampere’s missing trick

80GB vs 24GB — capacity advantage

Throughput across nine workloads

Per-token economics

Per-workload winner table

vLLM serving examples

Production gotchas

Verdict

Pick the FP8-native card for FP8-native workloads

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB to RTX 6000 Pro Upgrade

LLaMA 3 70B vs Mixtral 8x7B for Cost-Optimised Batch Processing: GPU Benchmark

Can RTX 5080 Run Flux.1?

Can RTX 3090 Run CodeLlama 34B?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?