RTX 3050 - Order Now
Home / Blog / Benchmarks / FlashAttention-3 Impact
Benchmarks

FlashAttention-3 Impact

FlashAttention-3 (2024) brings ~1.5-2× throughput improvement over FA-2 on Hopper / Blackwell. Real numbers and what changed.

FlashAttention-3 (released July 2024 by Tri Dao + Princeton + Together AI) brings major throughput improvements over FA-2 on Hopper / Blackwell GPUs. Async warp specialisation, FP8 native attention, better tensor-core utilisation. The vLLM integration arrived in late 2024 and is mature in 2026.

TL;DR

FA-3 vs FA-2 on Hopper / Blackwell: ~1.5-2× faster attention compute, ~40% lower memory. Native FP8 path. Available via vLLM 0.6.3+ for compatible GPUs (H100 / H200 / B200 / RTX 5090 / RTX 6000 Pro). Throughput improvement is meaningful for long-context inference where attention dominates.

What changed

  • Async warp specialisation: separate warps for compute vs memory; better tensor core utilisation
  • Native FP8 attention: previous FA implementations did FP16 attention even with FP8 weights; FA-3 supports FP8 throughout
  • Better Hopper async pipelining: more overlap of compute and memory operations
  • Memory layout improvements: lower KV cache memory

Performance

On H100 / RTX 5090 / RTX 6000 Pro, Llama 3.1 8B FP8:

  • FA-2: ~1,800 tok/s aggregate batch 16
  • FA-3: ~3,000 tok/s aggregate batch 16 (~1.7× on this workload)

The advantage scales with context length:

  • Short context (1K): ~1.3× faster
  • Medium context (8K): ~1.6× faster
  • Long context (32K): ~1.9× faster

Availability

  • Hopper (H100, H200): full FA-3 support
  • Blackwell consumer (RTX 5090, RTX 6000 Pro): full support; vLLM 0.6.4+
  • Blackwell server (B100, B200): full support
  • Ada Lovelace (RTX 4090, A100, etc.): not supported; stays on FA-2
  • Ampere (RTX 3090, A100): not supported

Verdict

For Hopper / Blackwell deployments, FA-3 is a substantial throughput win — mostly automatic via vLLM. For older hardware, FA-2 remains the standard. The gap will widen as more workloads adopt FA-3-aware tooling. Plan new deployments around Blackwell for the FA-3 advantage; older hardware works but loses the ~1.5-2× multiplier.

Bottom line

FA-3 = Blackwell's throughput advantage. See Blackwell vs Ada.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?