RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / RTX 4090 24GB and Ada’s 4th-Gen Tensor Cores: Native FP8 Explained
AI Hosting & Infrastructure

RTX 4090 24GB and Ada’s 4th-Gen Tensor Cores: Native FP8 Explained

A senior infra engineer's tour of how Ada Lovelace's 4th-generation tensor cores execute FP8 (E4M3 and E5M2) natively on the RTX 4090 24GB, the Transformer Engine's role, kernel selection (Marlin, Machete, FlashAttention 3, cuBLASLt), and where it lands against Ampere's no-FP8 design and Blackwell's FP4 follow-up.

FP8 is the precision that defines modern LLM inference, and the RTX 4090 24GB was the first consumer card to ship native FP8 support via Ada Lovelace’s 4th-generation tensor cores. On UK dedicated GPU hosting this single architectural feature is what decides whether a card runs frontier inference workloads at full throttle or has to fall back on FP16 emulation that throws away half the silicon. This piece walks through what FP8 actually is at the bit-pattern level, how the 4th-gen tensor cores execute it, what the Transformer Engine adds on top, the kernel families that ship FP8 on Ada today (Marlin, Machete, cuBLASLt, FlashAttention 3), and how the picture changes when you look back at Ampere or forward at Blackwell.

Contents

What FP8 actually is at the bit level

FP8 is an 8-bit floating-point format. Each value uses one sign bit and seven bits split between exponent and mantissa. Compared to FP16 it halves memory footprint and roughly doubles tensor core throughput; compared to INT8 it preserves dynamic range, which makes it well-suited to attention scores and activations that span many orders of magnitude in a single tensor. The standard codified two layouts in 2022 (NVIDIA, Arm, Intel joint paper): E4M3 with 4 exponent bits and 3 mantissa bits, and E5M2 with 5 exponent bits and 2 mantissa bits. They are tuned for different parts of the network.

For a 7B model, weights drop from roughly 14 GB at FP16 to 7 GB at FP8. The KV cache halves too. That is the headline reason FP8 unlocks larger context windows and higher batch sizes on the same 24 GB envelope: it is the cheapest way to double effective VRAM without changing the model. On a 4090 you can serve Llama 3.1 8B FP8 with 65k context and 32 concurrent sequences, where the FP16 build of the same model has to choose between long context and any meaningful batch at all.

The bandwidth implication is equally important. Decode is bandwidth-bound on a 4090 (1008 GB/s of GDDR6X feeds tensor cores capable of 660 dense FP8 TFLOPS). Halving the bytes-per-parameter doubles the bandwidth ceiling for autoregressive generation. The naive ceiling for an 8 GB FP8 model is 1008 / 8 = 126 t/s; a 16 GB FP16 model caps at 63 t/s. Real measured numbers exceed both because of L2 caching, but the doubling shows up cleanly in production logs. See the GDDR6X bandwidth deep-dive for the per-token math.

E4M3 and E5M2 formats explained

Two formats exist because no single 8-bit FP layout can carry both the precision needed by weights and the dynamic range needed by gradients. E4M3 carries more mantissa, so it is more accurate but covers a smaller range; E5M2 carries one extra exponent bit, so it covers a wider range at the cost of mantissa resolution.

FormatSign / Exp / MantissaMax representableMin normalUsed for
E4M31 / 4 / 3±4482^-6Weights, forward activations, attention output
E5M21 / 5 / 2±57,3442^-14Gradients, FP8 KV cache option, large-range tensors
FP161 / 5 / 10±65,5042^-14Reference precision
BF161 / 8 / 7±3.4e382^-126Training default on Ada

The 4090’s 4th-gen tensor cores execute both formats natively, with FP16 or FP32 accumulators. Inference typically uses E4M3 throughout the forward pass; training and FP8 KV in long-context inference often use E5M2 because attention values can drift outside the E4M3 envelope as context lengthens. The Transformer Engine (TE) library swaps formats per layer based on the tensor’s observed range, so application code rarely has to pick by hand.

The Transformer Engine and per-tensor scaling

NVIDIA’s Transformer Engine is a software layer (the transformer_engine Python package plus a CUDA library) that exists because plain FP8 multiplications in transformer math overflow or underflow if you simply truncate values from FP16. TE adds the missing piece: per-tensor or per-channel scaling that keeps every input matrix in the representable range. It does so by:

  • Tracking the absolute max (amax) of each tensor across recent training or inference steps.
  • Computing a per-tensor scaling factor that pushes the tensor to use the full ±448 (E4M3) or ±57344 (E5M2) range.
  • Applying that scale on the way into the tensor core and unscaling on the way out.
  • Selecting E4M3 or E5M2 per layer to maximise accuracy for that layer’s behaviour.
  • Falling back to FP16 for layers where FP8 would degrade quality (typically softmax inputs, layer norm, embedding output).

On the 4090 the scaling is partly hardware (the tensor cores accept a scale operand directly), partly software (the amax tracker and per-step calibration loop). Hopper added native support for the amax history in hardware; Ada handles it in software with negligible overhead. TE is exposed through TensorRT-LLM, vLLM (FP8 weights and FP8 KV cache), SGLang, and the standalone library for custom training loops. For a complete production deployment that uses all of this, see the FP8 Llama deployment guide.

Ampere vs Ada vs Hopper vs Blackwell

The single most useful framing of FP8 on a 4090 is to position it against the generation behind and the generation ahead. Ampere shipped without FP8 support entirely; Hopper introduced it with extra hardware help that Ada then inherited in cut-down form; Blackwell extends the same idea down to FP4 with a second-generation Transformer Engine.

GenerationCardTensor genNative FP8Native FP4Defining change
AmpereRTX 3090, A1003rdNoNoBF16 + TF32, no FP8 path
HopperH100, H2004thYes (E4M3, E5M2)NoFirst Transformer Engine, FP8 amax in HW
AdaRTX 4090, L40S4thYes (E4M3, E5M2)NoFP8 in consumer envelope, 72 MB L2
BlackwellRTX 5090, B2005thYesYes (E2M1)2nd-gen TE, FP4 added, GDDR7 / HBM3e

The 4090 shares its 4th-gen tensor core architecture with the H100. That makes it the cheapest card capable of running Hopper-class FP8 workloads (compare in the 4090 vs H100 80GB piece). Where Ada differs from Hopper is the L2 cache geometry (72 MB on the 4090 vs 50 MB on H100 SXM, but Hopper has 3 TB/s of HBM3 to compensate), the lack of FP8 amax tracking in dedicated silicon, and the absence of NVLink. The lack of NVLink rarely matters for FP8 inference because the format already shrinks weights enough to keep most workloads on a single card. The L2 advantage is real and surfaces clearly in small-model batched decode.

Going forward, Blackwell’s 5th-gen tensor cores add native FP4 (E2M1, ±6) which approximately doubles throughput again on a 5090. FP4 today is most useful for weight-only quantisation rather than activations, which means 4090 FP8 and 5090 FP4 will coexist as the right choice for different workloads for the next several years. The 4090 vs 5090 piece breaks down where each is the better economic call.

Kernel selection: Marlin, Machete, cuBLASLt, FlashAttention

Ada’s FP8 throughput is only realised by the right kernel for the right operation. The vLLM 0.6 and SGLang 0.3 stacks now ship a small library of specialised FP8 GEMM and attention kernels; picking one over another can swing realised throughput by 30 to 50 percent at the same batch size.

KernelOperationFormatWhen to use4090 peak utilisation
Marlin FP8GEMM (decode)FP8 weights, FP16 actDefault for FP8 weight-only models in vLLM~36% of dense FP8
MacheteGEMM (mixed precision)FP8 / INT4 weightsvLLM 0.7+ for AWQ + FP8 KV combos~40% of dense FP8
cuBLASLt FP8GEMM (prefill)E4M3 + E4M3Large batch prefill, TensorRT-LLM~70% of dense FP8
FlashAttention 3 FP8AttentionFP8 Q/K/VLong context decode, all stacks~32% (BW bound)
cuDNN FP8 fusedConv + GEMME4M3 + FP16Diffusion U-Net (FLUX, SDXL)~58% of dense FP8

Three rules of thumb decide kernel choice. First, prefill (large-batch matmul against a long prompt) benefits most from cuBLASLt FP8 because operand reuse keeps the tensor cores fed; expect 60-70 percent of dense peak. Second, decode is bandwidth-bound and the kernel choice shifts to Marlin or Machete, both designed to dequantise weights into FP16 registers on the fly while keeping FP8 KV in HBM; expect 30-40 percent of dense peak. Third, FlashAttention 3 with FP8 Q, K and V is now mature on Ada (FA3 v2.6+) and is the right choice for any context above 8k. The Marlin path is the default in vLLM when you pass --quantization fp8, and the Machete path takes over when you combine --quantization awq_marlin with --kv-cache-dtype fp8.

Throughput impact on real workloads

The headline FP8 speedup numbers below were captured on the gigagpu.com/ fleet running vLLM 0.6.3, FlashAttention 3, CUDA 12.4 and driver 555. Each row is a single 4090 24GB host with no other workload on the card.

WorkloadFP16 on 4090FP8 on 4090FP8 speedupVRAM saved
Llama 3.1 8B batch 1 decode95 t/s198 t/s2.08x8 GB
Llama 3.1 8B batch 32 aggregate~640 t/s1100 t/s1.72x8 GB
Mistral 7B v0.3 batch 1~108 t/s215 t/s1.99x7 GB
Mistral Nemo 12B batch 172 t/s145 t/s2.01x12 GB
Phi-3-mini 3.8B batch 1~245 t/s480 t/s1.96x4 GB
Qwen 2.5 7B batch 1~105 t/s210 t/s2.00x7 GB
FLUX.1-schnell 4-step2.6 s1.8 s1.44x6 GB
Llama 3.1 8B prefill (1 seq, 8k tok)6800 t/s12000 t/s1.76x8 GB

Decode-bound LLM workloads see roughly 2x throughput with FP8 because the bandwidth ceiling roughly doubles (halved bytes per token). Diffusion sees 1.4-1.5x because the U-Net is more compute-bound and a chunk of the workload is still in FP16 (text encoder, VAE). Memory savings are universal and roughly halve weights regardless of workload class. For a real 12-engineer coding team running Qwen 2.5 14B AWQ behind Continue.dev, switching from FP16 to FP8 weights is the difference between 4 concurrent active streams (FP16) and 16 (FP8) at the same 32k context budget.

FP8 KV cache: a separate decision

The flag pair --quantization fp8 and --kv-cache-dtype fp8 are independent. Quantising weights halves model size; quantising the KV cache halves attention memory. On a 200-MAU SaaS RAG running Mistral Nemo 12B with 32k context, FP8 KV is the headline win because it doubles the number of long-context users that can sit in memory concurrently. Expect a measurable but small quality hit: 0.1-0.3 MMLU points and roughly 1 perplexity unit at 128k context. For most chat and tooling pipelines that is invisible; for long-form summarisation or code review it is worth checking on your own eval set first. The vLLM setup guide walks through the calibration step.

Using FP8 in practice on the 4090

The minimal vLLM configuration to get full FP8 on Ada looks like this:

# Native FP8 weights and FP8 KV cache, Llama 3.1 8B
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \                 # E4M3 weights, Marlin kernel
  --kv-cache-dtype fp8 \               # E5M2 KV (vLLM default)
  --max-model-len 65536 \              # leaves ~6 GB for batched KV
  --max-num-seqs 32 \                  # batched decode amortises BW
  --enable-chunked-prefill \           # caps p99 TTFT
  --enable-prefix-caching \            # shared system prompt
  --gpu-memory-utilization 0.92        # leave headroom for cuBLAS workspace

Two flags do most of the heavy lifting. --quantization fp8 dispatches Marlin FP8 kernels and loads the weight checkpoint either as pre-quantised FP8 (preferred) or quantises on the fly from FP16 with a quick calibration pass. --kv-cache-dtype fp8 stores the KV blocks at E5M2 in HBM, halving attention memory. The --enable-chunked-prefill flag breaks long prompts into 512-token chunks that interleave with decode, which keeps p99 first-token latency bounded even when a 32k-token prompt arrives. Prefix caching deduplicates shared system prompts across requests and is essentially free. For the AWQ INT4 + FP8 KV combo that unlocks 70B on a single 4090, see the 70B INT4 deployment.

For TensorRT-LLM, the FP8 path is more involved because you build a static engine ahead of time:

# TensorRT-LLM FP8 build (one-time, then serve)
trtllm-build \
  --checkpoint_dir ./llama-3.1-8b-fp8-ckpt \
  --output_dir ./engine_fp8 \
  --gemm_plugin fp8 \                  # cuBLASLt FP8 GEMM
  --use_fp8_context_fmha \             # FP8 attention
  --max_batch_size 32 \
  --max_input_len 32768 \
  --max_output_len 2048

TensorRT-LLM extracts another 10-20 percent throughput over vLLM at the cost of a 5-10 minute build per model and per shape regime. Most production stacks settle on vLLM unless they need the absolute throughput ceiling and are happy to manage the engine cache. See the TFLOPS class piece for measured utilisation by kernel.

Production gotchas

  1. Pre-quantised checkpoints beat runtime calibration. Use neuralmagic’s Meta-Llama-3.1-8B-Instruct-FP8 rather than runtime quant; the calibration pass takes 30-90 seconds at startup and produces a slightly worse scaling.
  2. FP32 accumulator silently halves your FP16 throughput. Stock PyTorch nn.Linear uses FP32 accum even when inputs are FP16. The 330 TFLOPS FP16-accum number only shows up when you go through vLLM, FlashAttention or torch.compile FP16-accum paths. FP8 always uses FP16 accum on Ada so the hit does not apply once you switch.
  3. FP8 KV at very long context can degrade. E5M2 KV at 128k context can lose 1-2 perplexity points on summarisation. Validate on your eval set before pushing past 32k. For very long context, FP8 weights with FP16 KV is a safer middle ground.
  4. Driver and CUDA version matter. CUDA 12.4+ and driver 550+ are required for the FA3 FP8 kernels. Older drivers silently fall back to FP16 attention with no warning.
  5. Marlin requires AWQ checkpoints with group_size=128. Other group sizes fall through to a slower kernel (~50 percent throughput drop). Always check the model card.
  6. Single-stream decode wastes 90 percent of your tensor cores. FP8 doubles the bandwidth ceiling but does not change the underlying bandwidth-bound regime. Batch your traffic with vLLM continuous batching to extract the silicon you paid for; do not run llama.cpp single-stream in production.
  7. Quantisation is a deployment concern, not a research one. The interesting accuracy work (AWQ, GPTQ, SmoothQuant, FP8 calibration) is mostly done. Use the published checkpoints; do not roll your own unless you have a real reason.

Verdict and when FP8 is the right call

Pick FP8 on a 4090 if any of the following apply: your model is a transformer 7-70B with FP8 weights or AWQ+FP8 KV available; you are serving 1-32 concurrent users where decode is the dominant cost; you need 32k+ context and cannot afford to batch only 1-2 sequences at FP16; you are running a real production LLM workload where the 2x speedup is the difference between one card and two. For a 12-engineer coding team behind a Llama 3.1 8B FP8 backend the answer is unambiguous: FP8 doubles concurrent users at the same hardware cost.

Skip FP8 only when: your model has no FP8 checkpoint and runtime quantisation hurts your eval more than the speedup justifies; you are doing research on training quality where BF16 is the safer reference; your batch is consistently 1 with very short context, where the bandwidth ceiling is already reached at FP16. For a tier map of where FP8 fits in the modern lineup see tier positioning 2026; for a head-to-head with the Ampere-era no-FP8 baseline see 4090 vs 3090.

4th-gen tensor cores, native FP8, hosted in the UK

Hopper-class precision in a consumer envelope, vLLM and Marlin pre-built. UK dedicated hosting.

Order the RTX 4090 24GB

See also: spec breakdown, FP8 Llama deployment, vs RTX 3090 (no FP8), vs RTX 5090 (FP4), Llama 3 8B benchmark, 70B INT4 deployment, AWQ guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?