RTX 3050 - Order Now
Home / Blog / Model Guides / 8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B
Model Guides

8B LLM VRAM Requirements: Llama 3, Mistral 7B, Qwen 2.5 7B

Exact VRAM math for 8B-class LLMs at FP16, FP8 and AWQ INT4, including KV cache formulas and which GPUs fit at which context lengths.

The 8B class (Llama 3.1 8B, Mistral 7B v0.3, Qwen 2.5 7B, Gemma 2 9B) is where most self-hosted LLM traffic sits. It is small enough to fit on a 16 GB card like the RTX 5060 Ti and big enough to do real work. This article gives the exact weight math, the KV cache formula, and a GPU compatibility table, all on our UK dedicated GPU hosting.

Contents

Weight size at each precision

A nominal 8B model is 8 billion parameters; Llama 3.1 8B is 8.03B, Mistral 7B is 7.24B, Qwen 2.5 7B is 7.62B, Gemma 2 9B is 9.24B. Weight footprint follows directly from bytes per parameter:

ModelParamsFP16FP8AWQ INT4
Llama 3.1 8B8.03B16.1 GB8.1 GB5.4 GB
Mistral 7B v0.37.24B14.5 GB7.3 GB4.9 GB
Qwen 2.5 7B7.62B15.2 GB7.7 GB5.1 GB
Gemma 2 9B9.24B18.4 GB9.2 GB6.1 GB

Note: Llama 3.1 8B at FP16 is already 16.1 GB, so it does not fit in a 16 GB card at full precision once activation is counted. FP8 or AWQ is mandatory on the 5060 Ti. Gemma 2 9B is the borderline case that requires FP8.

KV cache formula

The formula is: KV per token = 2 * bytes * num_layers * num_kv_heads * head_dim. The first 2 is for K and V. At FP16, bytes = 2.

ModelLayersKV heads (GQA)Head dimKV/token (FP16)
Llama 3.1 8B328128131 KB
Mistral 7B v0.3328128131 KB
Qwen 2.5 7B28412857 KB
Gemma 2 9B428128172 KB

Qwen 2.5 7B is notably KV-efficient thanks to aggressive GQA: 57 KB/token vs 131 KB for Llama 3.1 8B. This means Qwen supports 2.3x the context per GB of KV cache.

Context length impact

Llama 3.1 8B, FP8 weights (8.1 GB). Remaining 7.9 GB on a 16 GB card (after CUDA overhead ~500 MB):

ContextKV/sequence (FP16)Concurrent sequences (Llama 3.1 8B)Concurrent (Qwen 2.5 7B)
4,0960.52 GB~14~32
8,1921.05 GB~7~16
16,3842.10 GB~3~8
32,7684.20 GB~1~4
65,5368.40 GBDoes not fit~2

FP8 KV cache (available in vLLM 0.6+) halves these numbers with a < 0.1 MMLU drop. For a deep dive see our context budget article.

Which GPUs fit 8B LLMs

GPUVRAMFP16 + 8k ctxFP8 + 8k ctxAWQ + 32k ctx
RTX 3050 8GB8 GBNoTight (bs=1, 2k)Yes, bs=1
RTX 4060 Ti 16GB16 GBNo FP8 supportn/aYes
RTX 3090 24GB24 GBYes, bs=2n/a (no FP8)Yes, bs=8
RTX 5060 Ti 16GB16 GBNoYes, bs=6Yes, bs=4
RTX 5090 32GB32 GBYes, bs=4Yes, bs=16Yes, bs=32
RTX 6000 Pro 96GB96 GBYes, bs=32Yes, bs=64+Yes, bs=128+

Throughput per card

Single-stream (bs=1) throughput for Llama 3.1 8B:

GPUPrecisionTokens/s (bs=1)Tokens/s (bs=16)
RTX 5060 Ti 16GBFP8100~720
RTX 3090 24GBFP1690~520
RTX 5090 32GBFP8175~1,800
RTX 6000 Pro 96GBFP8140~2,100
H100 80GBFP8210~3,400

Picking the right 8B model

  • Llama 3.1 8B: safest default, 128k context, strong tool use. See benchmark.
  • Qwen 2.5 7B: best KV efficiency, strong maths and coding.
  • Mistral 7B v0.3: compact, sliding window, good for low-VRAM work.
  • Gemma 2 9B: highest English MMLU in the class. See Gemma 2 guide.

Host 8B LLMs with headroom for real context

FP8 weights, FP8 KV cache, 16 GB GDDR7. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: Llama 3 8B benchmark, Gemma 9B benchmark, context budget, 32B VRAM requirements, 6 GB VRAM models.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?