Home / Blog / Benchmarks / FlashAttention-3 Impact

Benchmarks

FlashAttention-3 Impact

FlashAttention-3 (2024) brings ~1.5-2× throughput improvement over FA-2 on Hopper / Blackwell. Real numbers and what changed.

Benchmarks May 6, 2026 2 min read gigagpu

Table of Contents

FlashAttention-3 (released July 2024 by Tri Dao + Princeton + Together AI) brings major throughput improvements over FA-2 on Hopper / Blackwell GPUs. Async warp specialisation, FP8 native attention, better tensor-core utilisation. The vLLM integration arrived in late 2024 and is mature in 2026.

TL;DR

FA-3 vs FA-2 on Hopper / Blackwell: ~1.5-2× faster attention compute, ~40% lower memory. Native FP8 path. Available via vLLM 0.6.3+ for compatible GPUs (H100 / H200 / B200 / RTX 5090 / RTX 6000 Pro). Throughput improvement is meaningful for long-context inference where attention dominates.

What changed

Async warp specialisation: separate warps for compute vs memory; better tensor core utilisation
Native FP8 attention: previous FA implementations did FP16 attention even with FP8 weights; FA-3 supports FP8 throughout
Better Hopper async pipelining: more overlap of compute and memory operations
Memory layout improvements: lower KV cache memory

Performance

On H100 / RTX 5090 / RTX 6000 Pro, Llama 3.1 8B FP8:

FA-2: ~1,800 tok/s aggregate batch 16
FA-3: ~3,000 tok/s aggregate batch 16 (~1.7× on this workload)

The advantage scales with context length:

Short context (1K): ~1.3× faster
Medium context (8K): ~1.6× faster
Long context (32K): ~1.9× faster

Availability

Hopper (H100, H200): full FA-3 support
Blackwell consumer (RTX 5090, RTX 6000 Pro): full support; vLLM 0.6.4+
Blackwell server (B100, B200): full support
Ada Lovelace (RTX 4090, A100, etc.): not supported; stays on FA-2
Ampere (RTX 3090, A100): not supported

Verdict

For Hopper / Blackwell deployments, FA-3 is a substantial throughput win — mostly automatic via vLLM. For older hardware, FA-2 remains the standard. The gap will widen as more workloads adopt FA-3-aware tooling. Plan new deployments around Blackwell for the FA-3 advantage; older hardware works but loses the ~1.5-2× multiplier.

Bottom line

FA-3 = Blackwell's throughput advantage. See Blackwell vs Ada.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

FlashAttention-3 Impact

What changed

Performance

Availability

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

FlashAttention-3 Impact

What changed

Performance

Availability

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Memory Bandwidth vs TFLOPS: Why It Matters

DeepSeek R1 Distill Tokens/sec by GPU

Flux.1 Images/sec by GPU

LLaMA 3 70B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-70b-on-rtx-3090-benchmark, Excerpt: LLaMA 3 70B benchmarked on RTX 3090: 5.2 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?