RTX 3050 - Order Now
Home / Blog / Tutorials / Mixed Precision – BF16 vs FP16 for Training
Tutorials

Mixed Precision – BF16 vs FP16 for Training

BF16 is the right default on modern GPUs. FP16 is legacy. The difference matters for numerical stability in LLM training.

Mixed-precision training keeps weights in a small floating-point format (16-bit) while accumulating in 32-bit for stability. Two formats exist: FP16 (legacy IEEE half) and BF16 (brain floating point). On our dedicated GPU hosting BF16 is the right default for any modern GPU.

Contents

The Difference

Both are 16-bit. They allocate bits differently:

FormatSignExponentMantissaRange
FP321823~1e-38 to ~3e38
FP161510~6e-5 to ~65504
BF16187Same range as FP32

BF16 has the same dynamic range as FP32 but lower precision. FP16 has narrow range – values below ~6e-5 underflow to zero, values above ~65504 overflow to infinity.

Stability

LLM gradients and activations frequently contain very small or very large values. In FP16 these underflow or overflow and training diverges. FP16 training historically required loss scaling tricks (scale loss by 1024, unscale gradients) to work around this. BF16 has the range to represent these values directly – no loss scaling needed.

On Llama-class models, FP16 without loss scaling diverges within 100 steps. BF16 trains cleanly.

Hardware

ArchitectureBF16 Native
Ampere (3090)Yes
Ada (4060 Ti)Yes
Blackwell (5080, 5090, 6000 Pro)Yes, full-speed
AMD CDNA / RDNA 3+Yes
Pascal / VoltaNo – fall back to FP16 with scaling

Configuration

training_args = SFTConfig(
    bf16=True,
    fp16=False,
    ...
)

In 2026 every GPU in our lineup supports BF16 natively. Always prefer bf16=True over fp16=True. The only reason to use FP16 today is legacy code or running on pre-Ampere hardware outside our lineup.

BF16-Ready GPU Servers

Every UK dedicated card supports BF16 natively – no loss-scaling workarounds needed.

Browse GPU Servers

See gradient checkpointing and Flash Attention setup.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?