The tensor cores are the silicon block that actually accelerates AI workloads. On the RTX 5060 Ti 16GB they are 5th-generation Blackwell cores, bringing new data formats and roughly 2x throughput per clock over Ada on our dedicated hosting.
Contents
Generations
| Generation | Arch | Key Feature Added |
|---|---|---|
| 1st | Volta | FP16 matmul acceleration |
| 2nd | Turing | INT8 / INT4 support |
| 3rd | Ampere | TF32, BF16, structured sparsity |
| 4th | Ada / Hopper | FP8 on Hopper (H100), not Ada consumer |
| 5th (current) | Blackwell | Native FP8 on consumer, improved sparsity, FP4 preview |
Format Support
| Format | Support | Use Case |
|---|---|---|
| FP32 | Scalar only (not tensor) | Legacy, rarely used |
| TF32 | Yes | Mixed precision training (Ampere+ ABI) |
| BF16 | Yes, improved | Training default |
| FP16 | Yes | Legacy inference |
| FP8 E4M3 | Native | Inference weights + activations |
| FP8 E5M2 | Native | Training gradients |
| INT8 | Native fast path | AWQ/GPTQ quantised inference |
| INT4 | Marlin kernels | Aggressive quantisation |
Sparsity
2:4 structured sparsity means exactly half the weights in each group of 4 are zero. Nvidia’s tensor cores skip the zeros, delivering 2x effective throughput for compatible models. Few production models use this yet but the hardware supports it for:
- Models published with built-in 2:4 sparsity (emerging)
- Post-training sparsification of existing models
- Future architectures that target sparse compute
Not a factor today but hardware-ready for when it becomes relevant.
Work With Standard CUDA Kernels
Tensor cores are accessed through cuBLAS, cuDNN, and custom kernels in libraries like Flash Attention and Triton. Your Python code using PyTorch automatically dispatches to tensor cores when the shape matches supported formats. vLLM, TGI, and SGLang all use tensor cores transparently.
AI Impact
For mainstream workloads today, the biggest win is FP8. Running Mistral 7B or Llama 3 8B in FP8 on the 5060 Ti delivers ~1.7-2x the throughput of FP16 while using half the memory. The next biggest win is improved BF16 training speed for fine-tuning – 5th-gen tensor cores are ~15% faster than 4th-gen at the same clock.
Blackwell Tensor Cores
FP8 native, modern formats, production speed. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FP8 deep dive, TFLOPS comparison.