Yes, the RTX 5090 can run LLaMA 3 70B in INT4 on a single GPU. With 32GB GDDR7 VRAM, the RTX 5090 is one of the few consumer-class cards that can fit the LLaMA 3 70B model entirely in VRAM when quantised to 4-bit precision. This unlocks 70B-class reasoning on a single card without multi-GPU complexity.
The Short Answer
YES. LLaMA 3 70B in INT4 (GPTQ/AWQ) needs ~28GB, fitting within the 5090’s 32GB.
LLaMA 3 70B has 70 billion parameters. In FP16, the weights alone consume approximately 140GB, far beyond any single GPU. However, INT4 quantisation (4-bit) compresses the weights to roughly 35GB in the raw calculation, but optimised formats like GPTQ and AWQ with group quantisation bring the effective footprint to approximately 26-28GB. With KV cache for a 2048-token context, total VRAM sits around 29-30GB. The RTX 5090 handles this with 2-3GB to spare. For more detail, see our LLaMA 3 VRAM requirements guide.
VRAM Analysis
| Configuration | Weights | KV Cache (2K ctx) | Total | RTX 5090 (32GB) |
|---|---|---|---|---|
| LLaMA 3 70B FP16 | ~140GB | ~3GB | ~143GB | No |
| LLaMA 3 70B INT8 | ~70GB | ~2.5GB | ~72.5GB | No |
| LLaMA 3 70B INT4 (AWQ) | ~26GB | ~2GB | ~28GB | Fits |
| LLaMA 3 70B INT4 (GPTQ) | ~27GB | ~2GB | ~29GB | Fits (tight) |
| LLaMA 3 70B INT4 (4K ctx) | ~26GB | ~4GB | ~30GB | Very tight |
Context length is the main constraint. At 2048 tokens, the fit is comfortable. At 4096, you are pushing close to the limit. For longer contexts, you would need to reduce KV cache precision or use a smaller model. The AWQ format tends to be slightly more compact than GPTQ and is recommended for this card.
Performance Benchmarks
| GPU | LLaMA 3 70B INT4 (tok/s) | Notes |
|---|---|---|
| RTX 3090 (24GB) | N/A | Insufficient VRAM |
| RTX 5080 (16GB) | N/A | Insufficient VRAM |
| RTX 5090 (32GB) | ~18-22 | Single GPU, batch 1 |
| 2x RTX 3090 (48GB) | ~15-18 | Tensor parallel |
At 18-22 tokens per second, the RTX 5090 provides usable speed for interactive chat with a 70B model. This is slower than running 7B-13B models at 80+ tok/s, but 70B models deliver significantly better reasoning, coding, and instruction-following quality. The single-GPU setup eliminates the complexity and latency of tensor parallelism. More throughput comparisons are available on our benchmarks page.
Setup Guide
Use vLLM with AWQ quantisation for the best production experience:
# vLLM with AWQ quantised 70B
vllm serve TheBloke/Llama-3-70B-Instruct-AWQ \
--quantization awq \
--max-model-len 2048 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 --port 8000
For local testing with Ollama:
# Ollama with INT4 quantisation
ollama run llama3:70b-instruct-q4_K_M
Keep --max-model-len at 2048 initially and increase gradually while monitoring VRAM usage with nvidia-smi. Setting --gpu-memory-utilization 0.95 maximises available KV cache space.
Recommended Alternative
If you need longer context (8K+) with 70B, consider multi-GPU setups with two RTX 3090 cards for 48GB combined VRAM. For a single-GPU alternative with better speed, LLaMA 3 8B in FP16 delivers 95+ tok/s on the 5090 with far better quality per token than people expect from smaller models.
For other 5090 workloads, see whether it can run Mixtral 8x7B or multiple LLMs at once. For DeepSeek on this card, check the DeepSeek + Whisper combo guide. Browse all configurations on our dedicated GPU hosting page or in the GPU Comparisons category.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers