Table of Contents
CodeLlama VRAM Requirements Overview
Meta’s CodeLlama is a code-specialized version of LLaMA 2, available in 7B, 13B, and 34B parameter sizes with Base, Instruct, and Python variants. It supports up to 100K tokens of context through infilling and long-context fine-tuning. This guide covers VRAM for every variant to help you pick the right dedicated GPU server for code model hosting.
CodeLlama uses the same architecture as LLaMA 2 with additional training on code datasets. The 7B and 13B variants support 16K context natively, while the 34B model was trained on 16K context. All sizes support code infilling (fill-in-the-middle).
Complete VRAM Table (All Models)
| Model | Parameters | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|---|
| CodeLlama 7B | 6.7B | ~27 GB | ~13.5 GB | ~7 GB | ~4.5 GB |
| CodeLlama 7B Instruct | 6.7B | ~27 GB | ~13.5 GB | ~7 GB | ~4.5 GB |
| CodeLlama 7B Python | 6.7B | ~27 GB | ~13.5 GB | ~7 GB | ~4.5 GB |
| CodeLlama 13B | 13B | ~52 GB | ~26 GB | ~13 GB | ~8 GB |
| CodeLlama 13B Instruct | 13B | ~52 GB | ~26 GB | ~13 GB | ~8 GB |
| CodeLlama 13B Python | 13B | ~52 GB | ~26 GB | ~13 GB | ~8 GB |
| CodeLlama 34B | 33.7B | ~135 GB | ~67.5 GB | ~34 GB | ~20 GB |
| CodeLlama 34B Instruct | 33.7B | ~135 GB | ~67.5 GB | ~34 GB | ~20 GB |
| CodeLlama 34B Python | 33.7B | ~135 GB | ~67.5 GB | ~34 GB | ~20 GB |
The Base, Instruct, and Python variants at each size have identical VRAM requirements since they share the same architecture and parameter count. The difference is in training data and fine-tuning. For newer code models, also see our Qwen VRAM requirements page (Qwen2.5-Coder) and DeepSeek VRAM requirements (DeepSeek Coder).
Which GPU Do You Need?
| GPU | VRAM | Best CodeLlama | Precision | Use Case |
|---|---|---|---|---|
| RTX 3050 | 8 GB | 7B | 4-bit / INT8 | IDE copilot |
| RTX 4060 | 8 GB | 7B / 13B | INT8 / 4-bit | Dev / personal |
| RTX 4060 Ti | 16 GB | 7B / 13B | FP16 / INT8 | Code API |
| RTX 3090 | 24 GB | 13B / 34B | FP16 / 4-bit | Production |
| 2x RTX 3090 | 48 GB | 34B | FP16 | Best quality |
Context Length Impact on VRAM
Code completion tasks often need long context to include relevant files. Here is the KV cache impact:
| Context | 7B KV Cache | 13B KV Cache | 34B KV Cache |
|---|---|---|---|
| 4,096 | ~0.5 GB | ~1 GB | ~2 GB |
| 8,192 | ~1 GB | ~2 GB | ~4 GB |
| 16,384 | ~2 GB | ~4 GB | ~8 GB |
| 32,768 | ~4 GB | ~8 GB | ~16 GB |
| 100,000 | ~12 GB | ~24 GB | ~48 GB |
At 16K context (native limit for 7B/13B), the KV cache is manageable. Extended context to 100K using RoPE scaling requires substantial additional VRAM, especially for the 34B variant.
Batch Size Impact on VRAM
For IDE-style code completion, you typically serve one user at a time. For team code APIs, batching matters:
| Model (4-bit, 4K ctx) | Batch 1 | Batch 4 | Batch 8 | Batch 16 |
|---|---|---|---|---|
| CodeLlama 7B | ~5 GB | ~7 GB | ~9 GB | ~13 GB |
| CodeLlama 13B | ~9 GB | ~13 GB | ~17 GB | ~25 GB |
| CodeLlama 34B | ~22 GB | ~30 GB | ~38 GB | ~54 GB |
For a small development team (4-8 developers), CodeLlama 13B at 4-bit on an RTX 3090 handles concurrent requests well.
Practical Deployment Recommendations
- Personal IDE copilot: CodeLlama 7B on RTX 4060 (INT8 or 4-bit). Fast code completion at 25-30 tok/s. Use the Instruct variant for chat, Python variant for Python-specific tasks.
- Small team code API: CodeLlama 13B on RTX 3090 (INT8). Better code quality, handles 4+ concurrent developers.
- Best code quality: CodeLlama 34B on RTX 3090 (4-bit) or 2x RTX 3090 (FP16). State-of-the-art code generation.
- Newer alternatives: Consider Qwen2.5-Coder or DeepSeek Coder V2 for potentially better performance on recent benchmarks. See our code model hosting page for comparisons.
For cost comparisons, see our cost per 1M tokens: GPU vs API analysis and the LLM cost calculator.
Quick Setup Commands
Ollama (IDE Integration)
# Install and run for code completion
curl -fsSL https://ollama.com/install.sh | sh
ollama run codellama:7b-code # Code completion
ollama run codellama:7b-instruct # Chat about code
ollama run codellama:13b-instruct # Better quality
vLLM (Team API)
# CodeLlama 13B INT8 API server
pip install vllm
vllm serve codellama/CodeLlama-13b-Instruct-hf \
--quantization awq --max-model-len 16384
# CodeLlama 34B on RTX 3090 (4-bit)
vllm serve codellama/CodeLlama-34b-Instruct-hf \
--quantization awq --max-model-len 8192
Continue.dev Integration
# After starting Ollama, configure Continue.dev in VS Code:
# Set model to codellama:7b-code for autocomplete
# Set model to codellama:13b-instruct for chat
For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with other code models on our best GPU for LLM inference page and use the benchmark tool for speed data.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers