Table of Contents
Code Llama (Meta) ships in 7B, 13B, 34B and 70B parameter variants, plus Python-specialised and instruct-tuned forks. It was the first credible open-weight code model, and even now in 2026 plenty of IDE integrations and CI pipelines target it specifically. Sizing it for self-hosting is straightforward — the math is roughly the same as Llama 2 — but the long-context coding use case has KV cache implications worth a careful look.
Code Llama VRAM by size: 7B → 14 GB FP16 / 5 GB INT4. 13B → 26 GB FP16 / 8 GB INT4. 34B → 68 GB FP16 / 18 GB INT4. 70B → 140 GB FP16 / 40 GB INT4. For long-context (16K) coding workflows, add ~2 GB per concurrent stream. Most teams running Code Llama 13B on dedicated hardware land on a RTX 3090 or RTX 5090.
Headline numbers
| Variant | Params | FP16 | FP8 | AWQ-INT4 | GGUF Q5_K_M |
|---|---|---|---|---|---|
| Code Llama 7B | 6.7B | 13.4 GB | 6.7 GB | 4.5 GB | 5.0 GB |
| Code Llama 13B | 13B | 26 GB | 13 GB | 8.0 GB | 9.5 GB |
| Code Llama 34B | 34B | 68 GB | 34 GB | 18 GB | 24 GB |
| Code Llama 70B | 69B | 138 GB | 69 GB | 40 GB | 49 GB |
VRAM by Code Llama size
Code Llama 7B
Fits any modern GPU. INT4 (~5 GB) runs on an RTX 3050 6 GB; FP16 (~14 GB) needs a 16 GB+ card. Use cases: line-completion in IDEs, simple refactors, single-file code generation. Reference card: RTX 5060 8 GB at INT4 or RTX 5080 at FP16.
Code Llama 13B
The production sweet spot. ~26 GB FP16 fits a 32 GB card with room for KV cache; ~8 GB INT4 fits a 12 GB+ card. Use cases: production code-completion APIs, structured generation, code review automation.
Code Llama 34B
~68 GB FP16. Single-card options: RTX 6000 Pro 96 GB at FP16 or A100 80 GB. INT4 (~18 GB) fits a single RTX 5090. The 34B is meaningfully stronger than 13B on multi-file tasks but the hardware jump is significant — most teams skip directly from 13B to 70B if they outgrow.
Code Llama 70B
Same VRAM profile as Llama 3 70B. ~140 GB FP16 — multi-GPU only. ~40 GB INT4 — fits 2× RTX 5090. See can RTX 5090 run Llama? for the equivalent sizing on the Llama 3 family.
KV cache for long-context coding
Code Llama supports up to 100K context (16K natively, scaled with RoPE). Long context is genuinely useful for codebase Q&A but expensive in VRAM. Per-request KV cache:
| Variant | 8K context | 16K context | 32K context | 64K context |
|---|---|---|---|---|
| Code Llama 7B | 0.5 GB | 1.0 GB | 2.0 GB | 4.0 GB |
| Code Llama 13B | 0.8 GB | 1.6 GB | 3.2 GB | 6.4 GB |
| Code Llama 34B | 1.6 GB | 3.2 GB | 6.4 GB | 12.8 GB |
| Code Llama 70B | 2.5 GB | 5.0 GB | 10.0 GB | 20.0 GB |
Per concurrent request, FP16 KV cache. vLLM PagedAttention reduces fragmentation but not total size.
Which GPU fits which variant
| GPU | CL 7B | CL 13B | CL 34B | CL 70B |
|---|---|---|---|---|
| RTX 3050 6 GB | INT4 only | No | No | No |
| RTX 4060 8 GB | INT4 / FP8 | No | No | No |
| RTX 3060 12 GB | FP16 (tight) | INT4 | No | No |
| RTX 5060 Ti 16 GB | FP16 | INT4 | No | No |
| RTX 5080 16 GB | FP16 | INT4 / FP8 | No | No |
| RTX 3090 24 GB | FP16 | FP16 (tight) | INT4 | No |
| RTX 4090 24 GB | FP16 | FP16 (tight) | INT4 | No |
| RTX 5090 32 GB | FP16+ | FP16 | INT4 | INT3 (tight) |
| RTX 6000 Pro 96 GB | Trivial | Trivial | FP16 | FP8 |
| A100 80 GB | Trivial | Trivial | FP16 | FP8 (tight) |
Real-world deployments
- IDE backend, 50 engineers — Code Llama 13B FP16 on a single RTX 5090. ~80 tok/s single-stream. £399/mo.
- Code review bot — Code Llama 34B AWQ-INT4 on a single RTX 5090. ~22 tok/s but quality matches the dense 34B. £399/mo.
- Internal codebase Q&A — Code Llama 70B INT4 on 2× RTX 5090. 32K context fits. £899/mo.
- Embedded device prototype — Code Llama 7B INT4 on RTX 3050 6 GB. ~25 tok/s. £79/mo.
Code Llama vs DeepSeek-Coder vs Codestral
Honestly, in 2026 most new code-specialised deployments don’t pick Code Llama. The two stronger open alternatives:
- DeepSeek-Coder 6.7B / 33B — usually outperforms Code Llama at similar parameter counts. See best GPU for DeepSeek.
- Codestral 22B — Mistral’s code model. Strong on multi-file Python, Apache 2.0.
Code Llama remains relevant when you need ecosystem compatibility — IDE plugins that target the Code Llama API shape, fine-tunes built on the original weights — or when you want a 70B-class code model and DeepSeek-V3 is too big for your hardware.
Bottom line
For new Code Llama deployments: 13B at FP16 on RTX 5090 is the default. Drop to 7B INT4 on cheaper hardware if cost-driven; step up to 34B AWQ-INT4 (still on a single 5090) if quality-driven. For 70B, multi-GPU is the only viable path.
If you’re evaluating coding models, also benchmark DeepSeek-Coder — it’s often a better deployment per pound at similar quality.