Table of Contents
Tensor cores are the unit that makes GPU AI cheap. Understanding the generations explains why one card is faster than another for the same TFLOPS rating.
Tensor cores accelerate matrix multiplications — the bulk of LLM inference compute. 3rd gen (Ampere): FP16/BF16. 4th gen (Ada): + FP8 software path. 5th gen (Blackwell): + native FP8 + FP4 hardware. Each generation roughly doubles useful tensor throughput.
What tensor cores do
Specialised matrix-multiplication accelerators. A single tensor core can do a 4×4×4 matrix product in one cycle — far faster than the equivalent in CUDA cores.
Generations
| Gen | Cards | Native precisions | Notes |
|---|---|---|---|
| 3rd (Ampere) | A100, RTX 30-series | FP16, BF16, INT8 | Sparsity 2:4 supported |
| 4th (Ada) | RTX 40-series, L40S | FP16, BF16, INT8, FP8 (sw) | FP8 emulated, no native |
| Hopper | H100 | FP16, BF16, FP8 (native) | Datacenter only |
| 5th (Blackwell) | RTX 50-series, RTX 6000 Pro | FP16, BF16, FP8, FP4 (native) | FP4 is the new headline |
Verdict
Generations matter for AI workloads more than raw CUDA core count. FP8 / FP4 hardware is the practical AI advantage of newer cards.
Bottom line
Pick by tensor-core generation, not just TFLOPS. See Blackwell architecture overview.