RTX 3050 - Order Now
Home / Blog / GPU Comparisons / RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e
GPU Comparisons

RTX 4090 24GB vs A100 80GB: Consumer Ada FP8 vs Ampere Datacentre HBM2e

How a 4th-gen tensor core consumer card with native FP8 compares to the Ampere-era A100 80GB with 2 TB/s HBM2e bandwidth and no native FP8 — capacity vs precision, two different eras, one sharp choice.

The A100 80GB remains one of the most-deployed accelerators in production AI infrastructure. It pairs HBM2e (2.0 TB/s) with 80 GB of capacity, NVLink at 600 GB/s and MIG support — but stops at FP16/BF16. There is no native FP8. The RTX 4090 24GB brings native FP8 to a smaller VRAM envelope and roughly half the bandwidth, plus a 12x larger L2 cache. On UK GPU hosting the choice depends entirely on whether your workload is bandwidth-bound, capacity-bound, or precision-bound — and the answer is rarely obvious. This post explains which workloads pay back the A100 premium and which are better served by the cheaper, FP8-native consumer card.

Contents

Spec sheet side by side

SpecRTX 4090 (Ada AD102)A100 80GB SXM (Ampere GA100)Delta
ProcessTSMC 4NTSMC N7Two nodes ahead
SM count128108+19% NVIDIA Ada
CUDA cores16,3846,9122.37x Ada
Tensor cores512 (4th gen, FP8)432 (3rd gen, no FP8)Ada has FP8
Boost clock2.52 GHz1.41 GHz+79% Ada
VRAM24 GB GDDR6X80 GB HBM2e3.33x A100
Memory bandwidth1008 GB/s2039 GB/s2.02x A100
L2 cache72 MB40 MB+80% Ada
FP16 dense TFLOPS1653121.89x A100
FP8 dense TFLOPS660 (sparse)None (FP16 fallback ~312)2.12x Ada
NVLinkNoneNVLink 600 GB/sA100
MIGNone7-wayA100
TDP450W400W (SXM)+13% Ada
Approx UK price (2026)£1,300£8,000-10,000~7x A100

The interesting numbers: the 4090 has 79% higher clock, 80% larger L2, and native FP8 — but only half the bandwidth and a third of the VRAM. The A100 has the bigger, faster pool of memory but cannot use FP8, so its effective throughput on FP8-native workloads is closer than the bandwidth ratio suggests.

The FP8 question — Ampere’s missing trick

The A100 has no FP8 tensor instruction. When vLLM is asked for FP8 on an A100, it falls back to FP16 — halving the effective throughput per tensor-core op. There is no software fix. This is the single most important point of the comparison: for any modern LLM workload that can run FP8 (Llama, Mistral, Qwen, Phi, Gemma all have FP8 weights available), the 4090 has a precision-format advantage that partially offsets its bandwidth deficit.

For BF16 workloads (training, some research), the A100 wins because it has 2x the bandwidth and the format is its native sweet spot. For FP8 inference, the comparison is much closer than headline specs suggest. AWQ INT4 sits in between: bandwidth-dominated, so the A100’s HBM2e is decisive — Llama 70B AWQ on an A100 sustains ~52 t/s decode versus the 4090’s 22-24 t/s.

80GB vs 24GB — capacity advantage

Model / configurationRTX 4090 24GBA100 80GB
Llama 3.1 8B FP8ComfortableFP8 fallback to FP16
Llama 3.1 70B AWQ INT4TightComfortable
Llama 3.1 70B BF16 (140 GB)OOMOOM (single)
Llama 3.1 70B BF16 NVLink pairn/aComfortable
Qwen 2.5 72B AWQOOMComfortable
Mixtral 8x22B AWQ (74 GB)OOMComfortable
FLUX.1-dev FP16 (22 GB peak)ComfortableTrivial
50 concurrent Llama 8B sessionsOOM at KVComfortable
Llama 8B QLoRA + grad accumulationTightComfortable
MIG 4-way Llama 8B isolated tenantsn/aComfortable

Throughput across nine workloads

WorkloadRTX 4090A100 80GBWinner
Llama 3.1 8B FP8 decode b1198 t/s~95 t/s (FP16 fallback)4090 +108%
Llama 3.1 8B AWQ decode b1225 t/s~175 t/s4090 +29%
Llama 3.1 8B BF16 b1~95 t/s~145 t/sA100 +53%
Llama 3.1 70B AWQ b122-24 t/s~52 t/sA100 +120%
Llama 3.1 70B FP8OOMOOM (fits at FP16)Neither single
Mixtral 8x22B AWQOOM~46 t/sA100 only
SDXL 1024×10242.0s~1.6sA100 +25%
FLUX.1-dev FP166.2s~4.2sA100 +48%
50 concurrent Llama 8BOOM~2200 t/s aggA100 only

The 4090 wins decisively when FP8 is on the path. The A100 wins on bandwidth-dominated AWQ at large models, BF16 workloads, and anything needing 80GB.

Per-token economics

MetricRTX 4090A100 80GB SXM
TDP450W400W
Sustained LLM b32360W320W
UK list price (2026)£1,300£8,000-10,000
£/aggregate t/s b32 (Llama 8B)£1.18~£10.00
£/aggregate t/s for 70B AWQ£59~£155
UK cloud rental (typical)£0.70-1.20/hr£1.80-2.50/hr
Annual electricity @ 24/7 £0.18/kWh£568£505

For workloads the 4090 can run, it is dramatically cheaper per token. For Llama 70B and beyond, the A100’s larger VRAM and bandwidth shift the picture.

Per-workload winner table

WorkloadWinnerWhy
200-MAU SaaS RAG on Llama 8B FP84090FP8 native, 7x cheaper
12-engineer Qwen Coder 32B AWQ4090Fits, FP8-friendly path
Llama 70B AWQ production endpointA100HBM2e bandwidth helps
Mixtral 8x22B AWQA1004090 OOM
Multi-tenant 8B with MIG isolationA1004090 has no MIG
FLUX.1-dev studio at scaleA10080GB headroom
SDXL freelance studio40902.0s vs 1.6s — 7x cheaper
Voice agent (Whisper + 8B)4090FP8 path is decisive
Production training (BF16)A100NVLink + bandwidth
Capex-bounded MVP under £2k4090Only option

vLLM serving examples

# RTX 4090 — Llama 3 8B FP8, native, 32-way batch
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92
# A100 — same model, AWQ INT4 path (FP8 emulates to FP16, slower)
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 32768 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92
# A100 — Llama 70B AWQ at 32k context, the model 4090 cannot serve well
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.92

Production gotchas

  • FP8 silently falls back on A100. vLLM accepts --quantization fp8 on A100 but runs FP16 underneath at half the throughput. Use AWQ Marlin for 8-bit equivalent perf.
  • A100 SXM is HGX-only. The PCIe variant exists at 1.94 TB/s vs SXM’s 2.04 TB/s. Both are HGX-class deployments.
  • A100 cooling is a real cost. 400W per card x 4-8 in an HGX node needs serious airflow.
  • NVLink topology matters. 4-way and 8-way NVLink configurations give different all-reduce performance. Plan workload to match.
  • MIG partitioning is per-boot. Repartitioning requires GPU reset.
  • A100 used market is active. Ex-cloud A100 SXM modules appear regularly at 40-60% of new price; verify warranty status.
  • 4090 is not warranted in datacentre use. Strictly speaking, NVIDIA does not warrant 4090 for server deployment; the A100 (and 6000 Pro) is the supported choice.

Verdict

  • Pick the RTX 4090 24GB if your model fits in 24GB at FP8 or AWQ; you serve fewer than ~50 concurrent users; you want the lowest £/token; or you need on-prem UK hosting.
  • Pick the A100 80GB if you need 70B AWQ at high concurrency, Mixtral 8x22B, multi-tenant MIG isolation, or are doing BF16 training/fine-tuning at scale.
  • Pick neither if you need native FP8 at 80GB capacity — go to H100 80GB instead. The H100 closes the gap.

For a 200-MAU SaaS, the 4090 is the right answer. For a research lab fine-tuning Llama 70B BF16 on a 4-card NVLink node, the A100 is the established workhorse.

Pick the FP8-native card for FP8-native workloads

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB with 4th-gen tensor cores — native FP8, no fallbacks, at a fraction of A100 cost.

Order the RTX 4090 24GB

See also: vs H100 80GB, vs MI300X 192GB, vs RTX 6000 Pro 96GB, vs RTX 3090 24GB, FP8 tensor cores on Ada, RTX 4090 spec breakdown, 2026 tier positioning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?