Table of Contents
CodeLlama 34B Benchmark Overview
CodeLlama 34B is Meta’s specialised code generation model, fine-tuned from LLaMA 2 for programming tasks including code completion, infilling, and instruction following. At 34 billion parameters it requires roughly 68 GB at FP16, making quantisation mandatory for single dedicated GPU server deployments. We benchmark inference speed across six GPUs to find the optimal setup.
Tests used vLLM on GigaGPU bare-metal servers with a 512-token code input and 256-token completion output. At INT4, CodeLlama 34B requires approximately 17 GB of VRAM. See our tokens per second benchmark hub for methodology details.
Tokens/sec Results by GPU
| GPU | VRAM | CodeLlama 34B INT4 (tok/s) | Notes |
|---|---|---|---|
| RTX 3050 | 6 GB | N/A | Insufficient VRAM |
| RTX 4060 | 8 GB | N/A | Insufficient VRAM |
| RTX 4060 Ti | 16 GB | N/A | Tight; needs offloading |
| RTX 3090 | 24 GB | 15 tok/s | INT4 fits with headroom |
| RTX 5080 | 16 GB | N/A | Needs offloading at INT4 |
| RTX 5090 | 32 GB | 30 tok/s | Comfortable fit |
CodeLlama 34B at INT4 (approximately 17 GB) fits well on the RTX 3090 and RTX 5090. The RTX 5090 at 30 tok/s provides responsive code completions suitable for IDE integration.
Quantisation Comparison
INT8 requires roughly 34 GB, limiting it to the RTX 5090 among single consumer GPUs. For detailed quantisation trade-offs, see our quantisation speed comparison.
| GPU | INT8 (tok/s) | INT4 (tok/s) |
|---|---|---|
| RTX 3090 (24 GB) | N/A | 15 tok/s |
| RTX 5090 (32 GB) | 22 tok/s | 30 tok/s |
| 2x RTX 3090 (48 GB) | 20 tok/s | 28 tok/s |
For code generation tasks where precision matters, INT8 on the RTX 5090 at 22 tok/s is a strong choice. INT4 remains the speed champion, and quality degradation on code tasks is generally minimal.
Cost Efficiency Analysis
| Configuration | INT4 tok/s | Approx. Monthly Cost | tok/s per Pound |
|---|---|---|---|
| RTX 3090 | 15 | ~£110 | 0.14 |
| RTX 5090 | 30 | ~£250 | 0.12 |
| 2x RTX 3090 | 28 | ~£210 | 0.13 |
The single RTX 3090 provides the best cost efficiency. For the best GPU for CodeLlama, weigh latency requirements against your budget.
GPU Recommendations
- Budget: RTX 3090 — 15 tok/s at INT4 is workable for batch code generation and CI/CD integration.
- Recommended: RTX 5090 — 30 tok/s delivers responsive completions for real-time IDE plugins.
- Multi-GPU: 2x RTX 3090 — 28 tok/s with INT8 option for higher code accuracy.
For smaller code models, check the Phi-3 Mini benchmark (Phi-3 includes strong coding capabilities). For other large models, see the LLaMA 3 70B benchmark. Browse all results in the Benchmarks category.
Conclusion
CodeLlama 34B is the go-to model for production code generation that demands more capability than 7-13B models can offer. With INT4 quantisation, it runs well on the RTX 3090 and excels on the RTX 5090. For teams needing fast, accurate code completions, a dedicated GPU server with the right hardware is a worthwhile investment.
Deploy CodeLlama 34B on Dedicated Servers
High-VRAM GPU servers built for code generation workloads. Full root access, NVMe, and UK hosting.
Browse GPU Servers