Table of Contents
Can RTX 5090 Run 70B in FP16?
No, the RTX 5090 cannot run a 70B parameter model in FP16. A 70B model at FP16 requires approximately 140 GB of VRAM for weights alone, and the RTX 5090 has 32 GB of GDDR7. You would need roughly 4.4 RTX 5090s to hold the model weights. However, the 5090 can run 70B at 4-bit quantization with limited context, making it the best single consumer GPU for this task on a dedicated GPU server.
The RTX 5090 is NVIDIA’s flagship Blackwell consumer GPU with 32 GB of GDDR7 at approximately 1,792 GB/s bandwidth. This makes it the highest-VRAM consumer card available, but 70B in FP16 remains firmly in data center GPU territory.
The VRAM Math: 32 GB vs 140 GB
Here is a clear breakdown of why FP16 does not work:
| Component | FP16 Size | INT8 Size | 4-bit Size |
|---|---|---|---|
| 70B model weights | 140 GB | 70 GB | ~38 GB |
| KV cache (2K context) | ~2.5 GB | ~2.5 GB | ~2.5 GB |
| Activation memory | ~1 GB | ~1 GB | ~1 GB |
| Total required | ~143 GB | ~73 GB | ~41 GB |
| RTX 5090 VRAM | 32 GB | ||
| Deficit | -111 GB | -41 GB | -9 GB |
Even INT8 quantization (which preserves very high quality) needs 73 GB, more than double the 5090’s capacity. Standard 4-bit quantization at ~41 GB also exceeds 32 GB, though aggressive 3-bit quantization can squeeze the model in. See our LLaMA 3 VRAM requirements guide for details on each precision level.
What 70B Configurations Fit on 32 GB?
| Quantization | Weight Size | Total with KV | Fits 32 GB? | Quality vs FP16 |
|---|---|---|---|---|
| FP16 | 140 GB | ~143 GB | No | 100% |
| FP8 | 70 GB | ~73 GB | No | ~99% |
| INT8 | 70 GB | ~73 GB | No | ~98% |
| GPTQ 4-bit | ~38 GB | ~41 GB | No | ~94% |
| GGUF Q3_K_M | ~32 GB | ~34 GB | No | ~89% |
| GGUF Q2_K | ~26 GB | ~28 GB | Yes (tight) | ~83% |
| GGUF IQ3_XXS | ~28 GB | ~30 GB | Yes (minimal ctx) | ~86% |
The RTX 5090 can fit 70B at 2-3 bit quantization. GGUF IQ3_XXS is the best balance, offering roughly 86% of FP16 quality with a short context window. Q2_K fits more comfortably but quality drops further. Read our quantization format guide for details on each format.
Performance at Reduced Precision
Expected performance for 70B models on the RTX 5090:
| Configuration | Prompt (tok/s) | Generation (tok/s) | Context |
|---|---|---|---|
| 70B IQ3_XXS (all GPU) | ~80 | ~12-15 | ~1024 |
| 70B Q2_K (all GPU) | ~90 | ~14-17 | ~2048 |
| 70B Q4_K_M + CPU offload | ~30 | ~5-7 | ~2048 |
At 12-17 tok/s with extreme quantization, the 5090 delivers a usable but compromised experience for 70B. The Blackwell architecture’s high bandwidth helps significantly compared to older cards. Check our tokens per second benchmark for real-time comparisons.
What Models Can Run in FP16 on RTX 5090?
The 5090’s 32 GB excels at FP16 inference for models up to about 14-15B parameters:
| Model | FP16 VRAM | Fits 32 GB FP16? | Gen Speed |
|---|---|---|---|
| LLaMA 3 8B | ~16 GB | Yes (with batching) | ~70-80 tok/s |
| Mistral 7B | ~14 GB | Yes (comfortable) | ~80-90 tok/s |
| Qwen2 14B | ~28 GB | Yes (tight) | ~35-40 tok/s |
| Phi-3 14B | ~28 GB | Yes (tight) | ~35-40 tok/s |
| LLaMA 3 70B | ~140 GB | No | N/A at FP16 |
For models up to 14B, the RTX 5090 is exceptional. 32 GB gives you FP16 quality plus room for long context and batching. See related pages on Qwen VRAM requirements and Phi VRAM requirements.
Setup Commands
70B at Extreme Quantization
# Ollama with 70B (will auto-quantize)
ollama run llama3:70b
# llama.cpp with IQ3_XXS for best quality that fits
./llama-server -m llama-3-70b-IQ3_XXS.gguf \
-ngl 80 -c 1024 --host 0.0.0.0 --port 8080
8B-14B at FP16 (Recommended for 5090)
# vLLM serving LLaMA 3 8B at FP16 with batching
pip install vllm
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--dtype float16 --max-model-len 8192 \
--max-num-seqs 8 --gpu-memory-utilization 0.90
For deployment guides, visit our Ollama hosting and vLLM hosting pages.
GPUs That Can Run 70B in FP16
If you absolutely need 70B in FP16, here are the options:
| Setup | Total VRAM | 70B FP16? | Practical? |
|---|---|---|---|
| RTX 5090 (single) | 32 GB | No | 2-3 bit only |
| 2x RTX 3090 | 48 GB | No | 4-bit OK |
| 2x RTX 5090 | 64 GB | No | INT8 possible |
| RTX 6000 Pro 96 GB (single) | 80 GB | No | INT8 or 4-bit |
| 2x RTX 6000 Pro 96 GB | 160 GB | Yes | Full FP16 |
| 4x RTX 5090 | 128 GB | No (marginal) | Overhead issues |
Running 70B in true FP16 requires data center GPUs. For most practical purposes, INT8 or 4-bit quantization on consumer GPUs delivers 94-98% of FP16 quality at a fraction of the cost. Explore our multi-GPU cluster options or compare costs using the LLM cost calculator. Also see our RTX 3090 70B analysis and best GPU for LLM inference guide.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers