The Wrong Precision Format Costs You Either Speed or Accuracy
Your model produces garbage outputs in FP16 because of overflow, runs unnecessarily slow in FP32, or you are unsure whether FP8 degrades quality too much. Each floating-point format trades numerical range for speed differently, and modern GPUs have specialized hardware for specific precisions. Picking the right format determines whether your GPU server runs inference at its theoretical maximum or leaves performance on the table.
Floating-Point Format Specifications
# FP32 (IEEE 754 single precision)
# Bits: 1 sign + 8 exponent + 23 mantissa = 32 bits
# Range: ±3.4 × 10^38
# Precision: ~7 decimal digits
# Use: Training baseline, when accuracy is paramount
# Size per parameter: 4 bytes
# FP16 (IEEE 754 half precision)
# Bits: 1 sign + 5 exponent + 10 mantissa = 16 bits
# Range: ±65,504
# Precision: ~3.3 decimal digits
# Use: Inference, mixed-precision training
# Size per parameter: 2 bytes
# Risk: Overflow beyond 65,504 → Inf, underflow near zero
# BF16 (Brain Floating Point)
# Bits: 1 sign + 8 exponent + 7 mantissa = 16 bits
# Range: ±3.4 × 10^38 (same as FP32!)
# Precision: ~2.4 decimal digits
# Use: Training and inference where range matters more than precision
# Size per parameter: 2 bytes
# Advantage: Same range as FP32, no overflow issues
# FP8 E4M3 (4 exponent, 3 mantissa)
# Bits: 1 sign + 4 exponent + 3 mantissa = 8 bits
# Range: ±448
# Precision: ~1.2 decimal digits
# Use: Inference on Hopper GPUs, forward pass
# Size per parameter: 1 byte
# FP8 E5M2 (5 exponent, 2 mantissa)
# Bits: 1 sign + 5 exponent + 2 mantissa = 8 bits
# Range: ±57,344
# Precision: ~0.9 decimal digits
# Use: Gradients in FP8 training (wider range needed)
# Size per parameter: 1 byte
Throughput Comparison
# Theoretical TFLOPS by precision (NVIDIA GPUs)
#
# GPU FP32 TF32 FP16 BF16 FP8
# ---------------------------------------------------
# RTX 6000 Pro 19.5 156 312 312 -
# RTX 6000 Pro 67 ~500 990 990 1,979
# RTX 5090 82.6 - 330 330 660
#
# FP16 and BF16 have identical throughput on the same hardware
# FP8 doubles throughput again on Hopper/Ada GPUs
# LLM inference tokens/sec comparison (Llama-3-8B, single GPU)
#
# Format Model Size RTX 6000 Pro tok/s RTX 6000 Pro tok/s
# -----------------------------------------------
# FP32 32 GB ~30 ~45 (often doesn't fit)
# FP16 16 GB ~127 ~200
# BF16 16 GB ~127 ~200
# FP8 8 GB N/A ~380
# INT8 8 GB ~220 ~350
# INT4 4 GB ~350 ~550
#
# FP8 on RTX 6000 Pro nearly doubles FP16 throughput
# But requires Hopper Tensor Cores
# Benchmark on your hardware
import torch, time
def precision_benchmark(dtype, label, size=4096, iters=200):
a = torch.randn(size, size, device="cuda", dtype=dtype)
b = torch.randn(size, size, device="cuda", dtype=dtype)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iters):
torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tflops = 2 * size**3 * iters / elapsed / 1e12
print(f"{label}: {tflops:.1f} TFLOPS ({elapsed/iters*1000:.2f}ms)")
precision_benchmark(torch.float32, "FP32")
precision_benchmark(torch.float16, "FP16")
precision_benchmark(torch.bfloat16, "BF16")
Accuracy Tradeoffs
# FP16 overflow problem:
# Values > 65,504 become Inf
# Gradient scaling needed during training
# Some activations in large models naturally exceed this range
import torch
# Demonstrate FP16 overflow
x = torch.tensor(70000.0)
print(x.half()) # tensor(inf) — data lost!
print(x.bfloat16()) # tensor(69632.) — slightly imprecise but finite
# BF16 precision loss:
# Only 7 mantissa bits vs 10 in FP16
# Fine for weights and activations (rarely need 4+ decimal digits)
# Problematic for values that need exact representation
# Practical accuracy comparison (perplexity on validation set):
# Llama-3-8B:
# FP32: 6.14 (baseline)
# BF16: 6.14 (identical — within noise)
# FP16: 6.14 (identical for well-trained models)
# FP8: 6.18 (~0.5% degradation, acceptable)
# INT8: 6.17 (~0.5% degradation)
# INT4: 6.45 (~5% degradation, model-dependent)
# When FP16 causes problems:
# 1. Models trained in FP32 only (rare for modern LLMs)
# 2. Very deep networks with large activation values
# 3. Finetuning with high learning rates
# Solution: Use BF16 instead — same speed, no overflow
Choosing Precision for Your Workload
# Decision tree:
#
# Q: Do you have RTX 6000 Pro/RTX 6000 Pro?
# └─ Yes → Use FP8 for inference (2x speed, minimal quality loss)
# └─ vllm serve --dtype float16 --quantization fp8
#
# Q: Model trained in BF16?
# └─ Yes → Serve in BF16 (matches training precision)
# └─ vllm serve --dtype bfloat16
#
# Q: Model trained in FP16?
# └─ Yes → Serve in FP16
# └─ vllm serve --dtype float16
#
# Q: Seeing NaN or Inf in outputs?
# └─ Switch from FP16 to BF16 (fixes overflow)
#
# Q: Need maximum throughput, can tolerate ~1% quality loss?
# └─ Use INT8 quantization (AWQ or GPTQ)
# └─ Works on RTX 6000 Pro and RTX 6000 Pro
#
# Q: Need to fit larger model in limited VRAM?
# └─ Use INT4 quantization (AWQ-4bit or GPTQ-4bit)
# └─ ~5% quality loss, 4x memory reduction
# vLLM precision configuration
# FP16 (default, safe choice)
vllm serve meta-llama/Llama-3-70B-Instruct --dtype float16
# BF16 (preferred for models trained in BF16)
vllm serve meta-llama/Llama-3-70B-Instruct --dtype bfloat16
# FP8 on RTX 6000 Pro (maximum throughput)
vllm serve meta-llama/Llama-3-70B-Instruct --quantization fp8
# Auto-detect based on model config
vllm serve meta-llama/Llama-3-70B-Instruct --dtype auto
GPU Support Matrix
# Precision support by GPU architecture
#
# Precision Volta(RTX 6000 Pro) Turing(T4) Ampere(RTX 6000 Pro) Hopper(RTX 6000 Pro)
# -----------------------------------------------------------------
# FP32 CUDA CUDA CUDA CUDA
# TF32 No No Tensor Core Tensor Core
# FP16 Tensor Core Tensor Core Tensor Core Tensor Core
# BF16 No No Tensor Core Tensor Core
# FP8 No No No Tensor Core
# INT8 No Tensor Core Tensor Core Tensor Core
# INT4 No Tensor Core Tensor Core Tensor Core
#
# BF16 requires Ampere or newer
# FP8 requires Hopper or Ada Lovelace
# FP16 works on everything since Volta
# Check your GPU's capabilities
python3 -c "
import torch
props = torch.cuda.get_device_properties(0)
print(f'GPU: {props.name}')
print(f'Compute capability: {props.major}.{props.minor}')
print(f'BF16 support: {props.major >= 8}')
print(f'FP8 support: {props.major >= 9}')
"
Precision format directly determines inference throughput and model quality on your GPU server. See real throughput numbers in our token benchmarks. Deploy with the right precision in vLLM using the production guide. Set up PyTorch with our installation guide. Monitor utilization via our monitoring setup. Explore more benchmarks and tutorials.
Precision-Optimized GPU Servers
GigaGPU dedicated servers with RTX 6000 Pro and RTX 6000 Pro GPUs supporting FP16, BF16, and FP8. Run inference at the speed your precision allows.
Browse GPU Servers