Home / Blog / Tutorials / Mixed Precision – BF16 vs FP16 for Training

Tutorials

Mixed Precision – BF16 vs FP16 for Training

BF16 is the right default on modern GPUs. FP16 is legacy. The difference matters for numerical stability in LLM training.

Tutorials April 23, 2026 2 min read admin

Mixed-precision training keeps weights in a small floating-point format (16-bit) while accumulating in 32-bit for stability. Two formats exist: FP16 (legacy IEEE half) and BF16 (brain floating point). On our dedicated GPU hosting BF16 is the right default for any modern GPU.

The difference
Why BF16 is more stable
Hardware support
Configuration

The Difference

Both are 16-bit. They allocate bits differently:

Format	Sign	Exponent	Mantissa	Range
FP32	1	8	23	~1e-38 to ~3e38
FP16	1	5	10	~6e-5 to ~65504
BF16	1	8	7	Same range as FP32

BF16 has the same dynamic range as FP32 but lower precision. FP16 has narrow range – values below ~6e-5 underflow to zero, values above ~65504 overflow to infinity.

Stability

LLM gradients and activations frequently contain very small or very large values. In FP16 these underflow or overflow and training diverges. FP16 training historically required loss scaling tricks (scale loss by 1024, unscale gradients) to work around this. BF16 has the range to represent these values directly – no loss scaling needed.

On Llama-class models, FP16 without loss scaling diverges within 100 steps. BF16 trains cleanly.

Hardware

Architecture	BF16 Native
Ampere (3090)	Yes
Ada (4060 Ti)	Yes
Blackwell (5080, 5090, 6000 Pro)	Yes, full-speed
AMD CDNA / RDNA 3+	Yes
Pascal / Volta	No – fall back to FP16 with scaling

Configuration

training_args = SFTConfig(
    bf16=True,
    fp16=False,
    ...
)

In 2026 every GPU in our lineup supports BF16 natively. Always prefer bf16=True over fp16=True. The only reason to use FP16 today is legacy code or running on pre-Ampere hardware outside our lineup.

BF16-Ready GPU Servers

Every UK dedicated card supports BF16 natively – no loss-scaling workarounds needed.

Browse GPU Servers

See gradient checkpointing and Flash Attention setup.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Mixed Precision – BF16 vs FP16 for Training

Contents

The Difference

Stability

Hardware

Configuration

BF16-Ready GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Mixed Precision – BF16 vs FP16 for Training

Contents

The Difference

Stability

Hardware

Configuration

BF16-Ready GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Stable Diffusion Slow Generation: Speed Fix

Connect Microsoft Teams to Self-Hosted AI on GPU

How to Set Up TensorFlow on a Dedicated GPU Server

Connect Android App to Self-Hosted AI

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?