Home / Blog / Model Guides / 8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B

Model Guides

8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B

Exact VRAM math for 8B-class LLMs at FP16, FP8 and AWQ INT4, including KV cache formulas and which GPUs fit at which context lengths.

Model Guides April 23, 2026 2 min read admin

The 8B class (Llama 3.1 8B, Mistral 7B v0.3, Qwen 2.5 7B, Gemma 2 9B) is where most self-hosted LLM traffic sits. It is small enough to fit on a 16 GB card like the RTX 5060 Ti and big enough to do real work. This article gives the exact weight math, the KV cache formula, and a GPU compatibility table, all on our UK dedicated GPU hosting.

Weight size at each precision
KV cache formula
Context length impact
Which GPUs fit
Throughput per card
Picking the right model

Weight size at each precision

A nominal 8B model is 8 billion parameters; Llama 3.1 8B is 8.03B, Mistral 7B is 7.24B, Qwen 2.5 7B is 7.62B, Gemma 2 9B is 9.24B. Weight footprint follows directly from bytes per parameter:

Model	Params	FP16	FP8	AWQ INT4
Llama 3.1 8B	8.03B	16.1 GB	8.1 GB	5.4 GB
Mistral 7B v0.3	7.24B	14.5 GB	7.3 GB	4.9 GB
Qwen 2.5 7B	7.62B	15.2 GB	7.7 GB	5.1 GB
Gemma 2 9B	9.24B	18.4 GB	9.2 GB	6.1 GB

Note: Llama 3.1 8B at FP16 is already 16.1 GB, so it does not fit in a 16 GB card at full precision once activation is counted. FP8 or AWQ is mandatory on the 5060 Ti. Gemma 2 9B is the borderline case that requires FP8.

KV cache formula

The formula is: KV per token = 2 * bytes * num_layers * num_kv_heads * head_dim. The first 2 is for K and V. At FP16, bytes = 2.

Model	Layers	KV heads (GQA)	Head dim	KV/token (FP16)
Llama 3.1 8B	32	8	128	131 KB
Mistral 7B v0.3	32	8	128	131 KB
Qwen 2.5 7B	28	4	128	57 KB
Gemma 2 9B	42	8	128	172 KB

Qwen 2.5 7B is notably KV-efficient thanks to aggressive GQA: 57 KB/token vs 131 KB for Llama 3.1 8B. This means Qwen supports 2.3x the context per GB of KV cache.

Context length impact

Llama 3.1 8B, FP8 weights (8.1 GB). Remaining 7.9 GB on a 16 GB card (after CUDA overhead ~500 MB):

Context	KV/sequence (FP16)	Concurrent sequences (Llama 3.1 8B)	Concurrent (Qwen 2.5 7B)
4,096	0.52 GB	~14	~32
8,192	1.05 GB	~7	~16
16,384	2.10 GB	~3	~8
32,768	4.20 GB	~1	~4
65,536	8.40 GB	Does not fit	~2

FP8 KV cache (available in vLLM 0.6+) halves these numbers with a < 0.1 MMLU drop. For a deep dive see our context budget article.

Which GPUs fit 8B LLMs

GPU	VRAM	FP16 + 8k ctx	FP8 + 8k ctx	AWQ + 32k ctx
RTX 3050 8GB	8 GB	No	Tight (bs=1, 2k)	Yes, bs=1
RTX 4060 Ti 16GB	16 GB	No FP8 support	n/a	Yes
RTX 3090 24GB	24 GB	Yes, bs=2	n/a (no FP8)	Yes, bs=8
RTX 5060 Ti 16GB	16 GB	No	Yes, bs=6	Yes, bs=4
RTX 5090 32GB	32 GB	Yes, bs=4	Yes, bs=16	Yes, bs=32
RTX 6000 Pro 96GB	96 GB	Yes, bs=32	Yes, bs=64+	Yes, bs=128+

Throughput per card

Single-stream (bs=1) throughput for Llama 3.1 8B:

GPU	Precision	Tokens/s (bs=1)	Tokens/s (bs=16)
RTX 5060 Ti 16GB	FP8	100	~720
RTX 3090 24GB	FP16	90	~520
RTX 5090 32GB	FP8	175	~1,800
RTX 6000 Pro 96GB	FP8	140	~2,100
H100 80GB	FP8	210	~3,400

Picking the right 8B model

Llama 3.1 8B: safest default, 128k context, strong tool use. See benchmark.
Qwen 2.5 7B: best KV efficiency, strong maths and coding.
Mistral 7B v0.3: compact, sliding window, good for low-VRAM work.
Gemma 2 9B: highest English MMLU in the class. See Gemma 2 guide.

Host 8B LLMs with headroom for real context

FP8 weights, FP8 KV cache, 16 GB GDDR7. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B

Contents

Weight size at each precision

KV cache formula

Context length impact

Which GPUs fit 8B LLMs

Throughput per card

Picking the right 8B model

Host 8B LLMs with headroom for real context

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B

Contents

Weight size at each precision

KV cache formula

Context length impact

Which GPUs fit 8B LLMs

Throughput per card

Picking the right 8B model

Host 8B LLMs with headroom for real context

Need a Dedicated GPU Server?

admin

Related Articles

How to Run Flux.1 on a Dedicated GPU Server

Gemma 2 vs Gemma 1: Google’s Model Evolution

How to Set Up ComfyUI on a Dedicated GPU Server

Kokoro TTS VRAM Requirements

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?