RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 5090 Run LLaMA 3 70B in INT4?
GPU Comparisons

Can RTX 5090 Run LLaMA 3 70B in INT4?

Yes, the RTX 5090 can run LLaMA 3 70B in INT4 quantisation on a single GPU. Here is the VRAM breakdown, expected speed, and setup guide.

Yes, the RTX 5090 can run LLaMA 3 70B in INT4 on a single GPU. With 32GB GDDR7 VRAM, the RTX 5090 is one of the few consumer-class cards that can fit the LLaMA 3 70B model entirely in VRAM when quantised to 4-bit precision. This unlocks 70B-class reasoning on a single card without multi-GPU complexity.

The Short Answer

YES. LLaMA 3 70B in INT4 (GPTQ/AWQ) needs ~28GB, fitting within the 5090’s 32GB.

LLaMA 3 70B has 70 billion parameters. In FP16, the weights alone consume approximately 140GB, far beyond any single GPU. However, INT4 quantisation (4-bit) compresses the weights to roughly 35GB in the raw calculation, but optimised formats like GPTQ and AWQ with group quantisation bring the effective footprint to approximately 26-28GB. With KV cache for a 2048-token context, total VRAM sits around 29-30GB. The RTX 5090 handles this with 2-3GB to spare. For more detail, see our LLaMA 3 VRAM requirements guide.

VRAM Analysis

ConfigurationWeightsKV Cache (2K ctx)TotalRTX 5090 (32GB)
LLaMA 3 70B FP16~140GB~3GB~143GBNo
LLaMA 3 70B INT8~70GB~2.5GB~72.5GBNo
LLaMA 3 70B INT4 (AWQ)~26GB~2GB~28GBFits
LLaMA 3 70B INT4 (GPTQ)~27GB~2GB~29GBFits (tight)
LLaMA 3 70B INT4 (4K ctx)~26GB~4GB~30GBVery tight

Context length is the main constraint. At 2048 tokens, the fit is comfortable. At 4096, you are pushing close to the limit. For longer contexts, you would need to reduce KV cache precision or use a smaller model. The AWQ format tends to be slightly more compact than GPTQ and is recommended for this card.

Performance Benchmarks

GPULLaMA 3 70B INT4 (tok/s)Notes
RTX 3090 (24GB)N/AInsufficient VRAM
RTX 5080 (16GB)N/AInsufficient VRAM
RTX 5090 (32GB)~18-22Single GPU, batch 1
2x RTX 3090 (48GB)~15-18Tensor parallel

At 18-22 tokens per second, the RTX 5090 provides usable speed for interactive chat with a 70B model. This is slower than running 7B-13B models at 80+ tok/s, but 70B models deliver significantly better reasoning, coding, and instruction-following quality. The single-GPU setup eliminates the complexity and latency of tensor parallelism. More throughput comparisons are available on our benchmarks page.

Setup Guide

Use vLLM with AWQ quantisation for the best production experience:

# vLLM with AWQ quantised 70B
vllm serve TheBloke/Llama-3-70B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 --port 8000

For local testing with Ollama:

# Ollama with INT4 quantisation
ollama run llama3:70b-instruct-q4_K_M

Keep --max-model-len at 2048 initially and increase gradually while monitoring VRAM usage with nvidia-smi. Setting --gpu-memory-utilization 0.95 maximises available KV cache space.

If you need longer context (8K+) with 70B, consider multi-GPU setups with two RTX 3090 cards for 48GB combined VRAM. For a single-GPU alternative with better speed, LLaMA 3 8B in FP16 delivers 95+ tok/s on the 5090 with far better quality per token than people expect from smaller models.

For other 5090 workloads, see whether it can run Mixtral 8x7B or multiple LLMs at once. For DeepSeek on this card, check the DeepSeek + Whisper combo guide. Browse all configurations on our dedicated GPU hosting page or in the GPU Comparisons category.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?