RTX 3050 - Order Now
Home / Blog / Benchmarks / CodeLlama 34B Tokens/sec by GPU
Benchmarks

CodeLlama 34B Tokens/sec by GPU

Benchmark data for CodeLlama 34B inference speed across GPUs with INT4 and INT8 quantisation results and cost analysis for dedicated GPU hosting.

CodeLlama 34B Benchmark Overview

CodeLlama 34B is Meta’s specialised code generation model, fine-tuned from LLaMA 2 for programming tasks including code completion, infilling, and instruction following. At 34 billion parameters it requires roughly 68 GB at FP16, making quantisation mandatory for single dedicated GPU server deployments. We benchmark inference speed across six GPUs to find the optimal setup.

Tests used vLLM on GigaGPU bare-metal servers with a 512-token code input and 256-token completion output. At INT4, CodeLlama 34B requires approximately 17 GB of VRAM. See our tokens per second benchmark hub for methodology details.

Tokens/sec Results by GPU

GPUVRAMCodeLlama 34B INT4 (tok/s)Notes
RTX 30506 GBN/AInsufficient VRAM
RTX 40608 GBN/AInsufficient VRAM
RTX 4060 Ti16 GBN/ATight; needs offloading
RTX 309024 GB15 tok/sINT4 fits with headroom
RTX 508016 GBN/ANeeds offloading at INT4
RTX 509032 GB30 tok/sComfortable fit

CodeLlama 34B at INT4 (approximately 17 GB) fits well on the RTX 3090 and RTX 5090. The RTX 5090 at 30 tok/s provides responsive code completions suitable for IDE integration.

Quantisation Comparison

INT8 requires roughly 34 GB, limiting it to the RTX 5090 among single consumer GPUs. For detailed quantisation trade-offs, see our quantisation speed comparison.

GPUINT8 (tok/s)INT4 (tok/s)
RTX 3090 (24 GB)N/A15 tok/s
RTX 5090 (32 GB)22 tok/s30 tok/s
2x RTX 3090 (48 GB)20 tok/s28 tok/s

For code generation tasks where precision matters, INT8 on the RTX 5090 at 22 tok/s is a strong choice. INT4 remains the speed champion, and quality degradation on code tasks is generally minimal.

Cost Efficiency Analysis

ConfigurationINT4 tok/sApprox. Monthly Costtok/s per Pound
RTX 309015~£1100.14
RTX 509030~£2500.12
2x RTX 309028~£2100.13

The single RTX 3090 provides the best cost efficiency. For the best GPU for CodeLlama, weigh latency requirements against your budget.

GPU Recommendations

  • Budget: RTX 3090 — 15 tok/s at INT4 is workable for batch code generation and CI/CD integration.
  • Recommended: RTX 5090 — 30 tok/s delivers responsive completions for real-time IDE plugins.
  • Multi-GPU: 2x RTX 3090 — 28 tok/s with INT8 option for higher code accuracy.

For smaller code models, check the Phi-3 Mini benchmark (Phi-3 includes strong coding capabilities). For other large models, see the LLaMA 3 70B benchmark. Browse all results in the Benchmarks category.

Conclusion

CodeLlama 34B is the go-to model for production code generation that demands more capability than 7-13B models can offer. With INT4 quantisation, it runs well on the RTX 3090 and excels on the RTX 5090. For teams needing fast, accurate code completions, a dedicated GPU server with the right hardware is a worthwhile investment.

Deploy CodeLlama 34B on Dedicated Servers

High-VRAM GPU servers built for code generation workloads. Full root access, NVMe, and UK hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?