Table of Contents
FlashAttention-3 (released July 2024 by Tri Dao + Princeton + Together AI) brings major throughput improvements over FA-2 on Hopper / Blackwell GPUs. Async warp specialisation, FP8 native attention, better tensor-core utilisation. The vLLM integration arrived in late 2024 and is mature in 2026.
FA-3 vs FA-2 on Hopper / Blackwell: ~1.5-2× faster attention compute, ~40% lower memory. Native FP8 path. Available via vLLM 0.6.3+ for compatible GPUs (H100 / H200 / B200 / RTX 5090 / RTX 6000 Pro). Throughput improvement is meaningful for long-context inference where attention dominates.
What changed
- Async warp specialisation: separate warps for compute vs memory; better tensor core utilisation
- Native FP8 attention: previous FA implementations did FP16 attention even with FP8 weights; FA-3 supports FP8 throughout
- Better Hopper async pipelining: more overlap of compute and memory operations
- Memory layout improvements: lower KV cache memory
Performance
On H100 / RTX 5090 / RTX 6000 Pro, Llama 3.1 8B FP8:
- FA-2: ~1,800 tok/s aggregate batch 16
- FA-3: ~3,000 tok/s aggregate batch 16 (~1.7× on this workload)
The advantage scales with context length:
- Short context (1K): ~1.3× faster
- Medium context (8K): ~1.6× faster
- Long context (32K): ~1.9× faster
Availability
- Hopper (H100, H200): full FA-3 support
- Blackwell consumer (RTX 5090, RTX 6000 Pro): full support; vLLM 0.6.4+
- Blackwell server (B100, B200): full support
- Ada Lovelace (RTX 4090, A100, etc.): not supported; stays on FA-2
- Ampere (RTX 3090, A100): not supported
Verdict
For Hopper / Blackwell deployments, FA-3 is a substantial throughput win — mostly automatic via vLLM. For older hardware, FA-2 remains the standard. The gap will widen as more workloads adopt FA-3-aware tooling. Plan new deployments around Blackwell for the FA-3 advantage; older hardware works but loses the ~1.5-2× multiplier.
Bottom line
FA-3 = Blackwell's throughput advantage. See Blackwell vs Ada.