RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 5090 Run a 70B Model in FP16?
GPU Comparisons

Can RTX 5090 Run a 70B Model in FP16?

Can the RTX 5090 run a 70B parameter model in FP16? No — 32 GB is not enough for 140 GB of weights. We cover what it can run and the best alternatives.

Can RTX 5090 Run 70B in FP16?

No, the RTX 5090 cannot run a 70B parameter model in FP16. A 70B model at FP16 requires approximately 140 GB of VRAM for weights alone, and the RTX 5090 has 32 GB of GDDR7. You would need roughly 4.4 RTX 5090s to hold the model weights. However, the 5090 can run 70B at 4-bit quantization with limited context, making it the best single consumer GPU for this task on a dedicated GPU server.

The RTX 5090 is NVIDIA’s flagship Blackwell consumer GPU with 32 GB of GDDR7 at approximately 1,792 GB/s bandwidth. This makes it the highest-VRAM consumer card available, but 70B in FP16 remains firmly in data center GPU territory.

The VRAM Math: 32 GB vs 140 GB

Here is a clear breakdown of why FP16 does not work:

ComponentFP16 SizeINT8 Size4-bit Size
70B model weights140 GB70 GB~38 GB
KV cache (2K context)~2.5 GB~2.5 GB~2.5 GB
Activation memory~1 GB~1 GB~1 GB
Total required~143 GB~73 GB~41 GB
RTX 5090 VRAM32 GB
Deficit-111 GB-41 GB-9 GB

Even INT8 quantization (which preserves very high quality) needs 73 GB, more than double the 5090’s capacity. Standard 4-bit quantization at ~41 GB also exceeds 32 GB, though aggressive 3-bit quantization can squeeze the model in. See our LLaMA 3 VRAM requirements guide for details on each precision level.

What 70B Configurations Fit on 32 GB?

QuantizationWeight SizeTotal with KVFits 32 GB?Quality vs FP16
FP16140 GB~143 GBNo100%
FP870 GB~73 GBNo~99%
INT870 GB~73 GBNo~98%
GPTQ 4-bit~38 GB~41 GBNo~94%
GGUF Q3_K_M~32 GB~34 GBNo~89%
GGUF Q2_K~26 GB~28 GBYes (tight)~83%
GGUF IQ3_XXS~28 GB~30 GBYes (minimal ctx)~86%

The RTX 5090 can fit 70B at 2-3 bit quantization. GGUF IQ3_XXS is the best balance, offering roughly 86% of FP16 quality with a short context window. Q2_K fits more comfortably but quality drops further. Read our quantization format guide for details on each format.

Performance at Reduced Precision

Expected performance for 70B models on the RTX 5090:

ConfigurationPrompt (tok/s)Generation (tok/s)Context
70B IQ3_XXS (all GPU)~80~12-15~1024
70B Q2_K (all GPU)~90~14-17~2048
70B Q4_K_M + CPU offload~30~5-7~2048

At 12-17 tok/s with extreme quantization, the 5090 delivers a usable but compromised experience for 70B. The Blackwell architecture’s high bandwidth helps significantly compared to older cards. Check our tokens per second benchmark for real-time comparisons.

What Models Can Run in FP16 on RTX 5090?

The 5090’s 32 GB excels at FP16 inference for models up to about 14-15B parameters:

ModelFP16 VRAMFits 32 GB FP16?Gen Speed
LLaMA 3 8B~16 GBYes (with batching)~70-80 tok/s
Mistral 7B~14 GBYes (comfortable)~80-90 tok/s
Qwen2 14B~28 GBYes (tight)~35-40 tok/s
Phi-3 14B~28 GBYes (tight)~35-40 tok/s
LLaMA 3 70B~140 GBNoN/A at FP16

For models up to 14B, the RTX 5090 is exceptional. 32 GB gives you FP16 quality plus room for long context and batching. See related pages on Qwen VRAM requirements and Phi VRAM requirements.

Setup Commands

70B at Extreme Quantization

# Ollama with 70B (will auto-quantize)
ollama run llama3:70b

# llama.cpp with IQ3_XXS for best quality that fits
./llama-server -m llama-3-70b-IQ3_XXS.gguf \
  -ngl 80 -c 1024 --host 0.0.0.0 --port 8080

8B-14B at FP16 (Recommended for 5090)

# vLLM serving LLaMA 3 8B at FP16 with batching
pip install vllm
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype float16 --max-model-len 8192 \
  --max-num-seqs 8 --gpu-memory-utilization 0.90

For deployment guides, visit our Ollama hosting and vLLM hosting pages.

GPUs That Can Run 70B in FP16

If you absolutely need 70B in FP16, here are the options:

SetupTotal VRAM70B FP16?Practical?
RTX 5090 (single)32 GBNo2-3 bit only
2x RTX 309048 GBNo4-bit OK
2x RTX 509064 GBNoINT8 possible
RTX 6000 Pro 96 GB (single)80 GBNoINT8 or 4-bit
2x RTX 6000 Pro 96 GB160 GBYesFull FP16
4x RTX 5090128 GBNo (marginal)Overhead issues

Running 70B in true FP16 requires data center GPUs. For most practical purposes, INT8 or 4-bit quantization on consumer GPUs delivers 94-98% of FP16 quality at a fraction of the cost. Explore our multi-GPU cluster options or compare costs using the LLM cost calculator. Also see our RTX 3090 70B analysis and best GPU for LLM inference guide.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?