Home / Blog / Benchmarks / FP16 vs BF16 vs FP8 for AI Inference

Benchmarks

FP16 vs BF16 vs FP8 for AI Inference

Compare FP16, BF16, and FP8 precision formats for AI inference. Covers numerical ranges, accuracy tradeoffs, throughput differences, GPU support, and choosing the right precision for LLM serving.

Benchmarks April 16, 2026 4 min read admin

The Wrong Precision Format Costs You Either Speed or Accuracy

Your model produces garbage outputs in FP16 because of overflow, runs unnecessarily slow in FP32, or you are unsure whether FP8 degrades quality too much. Each floating-point format trades numerical range for speed differently, and modern GPUs have specialized hardware for specific precisions. Picking the right format determines whether your GPU server runs inference at its theoretical maximum or leaves performance on the table.

Floating-Point Format Specifications

# FP32 (IEEE 754 single precision)
# Bits: 1 sign + 8 exponent + 23 mantissa = 32 bits
# Range: ±3.4 × 10^38
# Precision: ~7 decimal digits
# Use: Training baseline, when accuracy is paramount
# Size per parameter: 4 bytes

# FP16 (IEEE 754 half precision)
# Bits: 1 sign + 5 exponent + 10 mantissa = 16 bits
# Range: ±65,504
# Precision: ~3.3 decimal digits
# Use: Inference, mixed-precision training
# Size per parameter: 2 bytes
# Risk: Overflow beyond 65,504 → Inf, underflow near zero

# BF16 (Brain Floating Point)
# Bits: 1 sign + 8 exponent + 7 mantissa = 16 bits
# Range: ±3.4 × 10^38 (same as FP32!)
# Precision: ~2.4 decimal digits
# Use: Training and inference where range matters more than precision
# Size per parameter: 2 bytes
# Advantage: Same range as FP32, no overflow issues

# FP8 E4M3 (4 exponent, 3 mantissa)
# Bits: 1 sign + 4 exponent + 3 mantissa = 8 bits
# Range: ±448
# Precision: ~1.2 decimal digits
# Use: Inference on Hopper GPUs, forward pass
# Size per parameter: 1 byte

# FP8 E5M2 (5 exponent, 2 mantissa)
# Bits: 1 sign + 5 exponent + 2 mantissa = 8 bits
# Range: ±57,344
# Precision: ~0.9 decimal digits
# Use: Gradients in FP8 training (wider range needed)
# Size per parameter: 1 byte

Throughput Comparison

# Theoretical TFLOPS by precision (NVIDIA GPUs)
#
# GPU       FP32     TF32     FP16     BF16     FP8
# ---------------------------------------------------
# RTX 6000 Pro      19.5     156      312      312       -
# RTX 6000 Pro      67       ~500     990      990      1,979
# RTX 5090  82.6      -       330      330       660
#
# FP16 and BF16 have identical throughput on the same hardware
# FP8 doubles throughput again on Hopper/Ada GPUs

# LLM inference tokens/sec comparison (Llama-3-8B, single GPU)
#
# Format    Model Size  RTX 6000 Pro tok/s   RTX 6000 Pro tok/s
# -----------------------------------------------
# FP32      32 GB       ~30          ~45      (often doesn't fit)
# FP16      16 GB       ~127         ~200
# BF16      16 GB       ~127         ~200
# FP8       8 GB        N/A          ~380
# INT8      8 GB        ~220         ~350
# INT4      4 GB        ~350         ~550
#
# FP8 on RTX 6000 Pro nearly doubles FP16 throughput
# But requires Hopper Tensor Cores

# Benchmark on your hardware
import torch, time

def precision_benchmark(dtype, label, size=4096, iters=200):
    a = torch.randn(size, size, device="cuda", dtype=dtype)
    b = torch.randn(size, size, device="cuda", dtype=dtype)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(iters):
        torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    tflops = 2 * size**3 * iters / elapsed / 1e12
    print(f"{label}: {tflops:.1f} TFLOPS ({elapsed/iters*1000:.2f}ms)")

precision_benchmark(torch.float32, "FP32")
precision_benchmark(torch.float16, "FP16")
precision_benchmark(torch.bfloat16, "BF16")

Accuracy Tradeoffs

# FP16 overflow problem:
# Values > 65,504 become Inf
# Gradient scaling needed during training
# Some activations in large models naturally exceed this range

import torch
# Demonstrate FP16 overflow
x = torch.tensor(70000.0)
print(x.half())  # tensor(inf) — data lost!
print(x.bfloat16())  # tensor(69632.) — slightly imprecise but finite

# BF16 precision loss:
# Only 7 mantissa bits vs 10 in FP16
# Fine for weights and activations (rarely need 4+ decimal digits)
# Problematic for values that need exact representation

# Practical accuracy comparison (perplexity on validation set):
# Llama-3-8B:
#   FP32: 6.14 (baseline)
#   BF16: 6.14 (identical — within noise)
#   FP16: 6.14 (identical for well-trained models)
#   FP8:  6.18 (~0.5% degradation, acceptable)
#   INT8: 6.17 (~0.5% degradation)
#   INT4: 6.45 (~5% degradation, model-dependent)

# When FP16 causes problems:
# 1. Models trained in FP32 only (rare for modern LLMs)
# 2. Very deep networks with large activation values
# 3. Finetuning with high learning rates
# Solution: Use BF16 instead — same speed, no overflow

Choosing Precision for Your Workload

# Decision tree:
#
# Q: Do you have RTX 6000 Pro/RTX 6000 Pro?
# └─ Yes → Use FP8 for inference (2x speed, minimal quality loss)
#    └─ vllm serve --dtype float16 --quantization fp8
#
# Q: Model trained in BF16?
# └─ Yes → Serve in BF16 (matches training precision)
#    └─ vllm serve --dtype bfloat16
#
# Q: Model trained in FP16?
# └─ Yes → Serve in FP16
#    └─ vllm serve --dtype float16
#
# Q: Seeing NaN or Inf in outputs?
# └─ Switch from FP16 to BF16 (fixes overflow)
#
# Q: Need maximum throughput, can tolerate ~1% quality loss?
# └─ Use INT8 quantization (AWQ or GPTQ)
#    └─ Works on RTX 6000 Pro and RTX 6000 Pro
#
# Q: Need to fit larger model in limited VRAM?
# └─ Use INT4 quantization (AWQ-4bit or GPTQ-4bit)
#    └─ ~5% quality loss, 4x memory reduction

# vLLM precision configuration
# FP16 (default, safe choice)
vllm serve meta-llama/Llama-3-70B-Instruct --dtype float16

# BF16 (preferred for models trained in BF16)
vllm serve meta-llama/Llama-3-70B-Instruct --dtype bfloat16

# FP8 on RTX 6000 Pro (maximum throughput)
vllm serve meta-llama/Llama-3-70B-Instruct --quantization fp8

# Auto-detect based on model config
vllm serve meta-llama/Llama-3-70B-Instruct --dtype auto

GPU Support Matrix

# Precision support by GPU architecture
#
# Precision   Volta(RTX 6000 Pro)  Turing(T4)  Ampere(RTX 6000 Pro)  Hopper(RTX 6000 Pro)
# -----------------------------------------------------------------
# FP32        CUDA         CUDA        CUDA          CUDA
# TF32        No           No          Tensor Core   Tensor Core
# FP16        Tensor Core  Tensor Core Tensor Core   Tensor Core
# BF16        No           No          Tensor Core   Tensor Core
# FP8         No           No          No            Tensor Core
# INT8        No           Tensor Core Tensor Core   Tensor Core
# INT4        No           Tensor Core Tensor Core   Tensor Core
#
# BF16 requires Ampere or newer
# FP8 requires Hopper or Ada Lovelace
# FP16 works on everything since Volta

# Check your GPU's capabilities
python3 -c "
import torch
props = torch.cuda.get_device_properties(0)
print(f'GPU: {props.name}')
print(f'Compute capability: {props.major}.{props.minor}')
print(f'BF16 support: {props.major >= 8}')
print(f'FP8 support: {props.major >= 9}')
"

Precision format directly determines inference throughput and model quality on your GPU server. See real throughput numbers in our token benchmarks. Deploy with the right precision in vLLM using the production guide. Set up PyTorch with our installation guide. Monitor utilization via our monitoring setup. Explore more benchmarks and tutorials.

Precision-Optimized GPU Servers

GigaGPU dedicated servers with RTX 6000 Pro and RTX 6000 Pro GPUs supporting FP16, BF16, and FP8. Run inference at the speed your precision allows.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

FP16 vs BF16 vs FP8 for AI Inference

The Wrong Precision Format Costs You Either Speed or Accuracy

Floating-Point Format Specifications

Throughput Comparison

Accuracy Tradeoffs

Choosing Precision for Your Workload

GPU Support Matrix

Precision-Optimized GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

FP16 vs BF16 vs FP8 for AI Inference

The Wrong Precision Format Costs You Either Speed or Accuracy

Floating-Point Format Specifications

Throughput Comparison

Accuracy Tradeoffs

Choosing Precision for Your Workload

GPU Support Matrix

Precision-Optimized GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

PaddleOCR on RTX 3050: OCR Speed & Cost, Category: Benchmarks, Slug: paddleocr-on-rtx-3050-benchmark, Excerpt: PaddleOCR benchmarked on RTX 3050: 12 pages/sec, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

CUDA Graph Optimization for Inference

Coqui XTTS-v2 on RTX 3050: TTS Speed & Cost, Category: Benchmarks, Slug: coqui-xtts-v2-on-rtx-3050-benchmark, Excerpt: Coqui XTTS-v2 benchmarked on RTX 3050: RTF 0.65, 1.5x real-time processing, VRAM usage, and cost per audio hour., Internal links: 8 –>

DeepSeek 7B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-5090-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 5090: 95.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?