Table of Contents
Quantisation Benchmark Overview
LLaMA 3 8B is one of the most popular open LLMs for inference. Running it at different precision levels, FP16 (full), INT8 (8-bit), and INT4 (4-bit), trades quality for speed and VRAM savings. Testing on a dedicated GPU server reveals exactly how much speed you gain from quantisation.
We benchmarked LLaMA 3 8B Instruct on GigaGPU servers using vLLM (FP16, AWQ INT4) and llama.cpp (GGUF Q8_0, Q4_K_M) at batch size 1 with a 512-token prompt and 256-token generation. VRAM requirements range from ~5GB (INT4) to ~16GB (FP16). For detailed memory analysis, see our LLaMA 3 VRAM requirements guide.
Tokens/sec by Precision and GPU
| GPU | VRAM | FP16 (tok/s) | INT8 (tok/s) | INT4 (tok/s) |
|---|---|---|---|---|
| RTX 3050 | 6 GB | N/A (OOM) | N/A (OOM) | 28 |
| RTX 4060 | 8 GB | N/A (OOM) | 32 | 48 |
| RTX 4060 Ti | 16 GB | 35 | 45 | 58 |
| RTX 3090 | 24 GB | 48 | 60 | 72 |
| RTX 5080 | 16 GB | 72 | 88 | 105 |
| RTX 5090 | 32 GB | 98 | 118 | 138 |
INT4 quantisation delivers 30-45% more tokens per second than FP16 on the same GPU. This speedup comes from reduced memory bandwidth requirements, as 4-bit weights transfer twice as fast as 8-bit and four times as fast as 16-bit. The RTX 5080 at INT4 (105 tok/s) outperforms the RTX 3090 at FP16 (48 tok/s) by more than 2x.
Quality Impact of Quantisation
Quantisation reduces model quality. Below we compare perplexity (lower is better) and MMLU accuracy across precision levels.
| Precision | VRAM (weights) | Perplexity (WikiText) | MMLU Accuracy | Speed Gain vs FP16 |
|---|---|---|---|---|
| FP16 | ~16GB | 6.14 | 66.5% | Baseline |
| INT8 (Q8_0) | ~8.5GB | 6.18 | 66.2% | ~25-30% |
| INT4 (Q4_K_M) | ~5GB | 6.42 | 64.8% | ~40-45% |
INT8 quantisation is nearly lossless, with perplexity increasing by just 0.04 and MMLU dropping by 0.3%. INT4 shows more degradation but remains useful for most applications. For tasks requiring maximum accuracy (coding, reasoning), prefer FP16 or INT8. For chatbots and general Q&A, INT4 is often indistinguishable in practice.
Cost Efficiency Analysis
| GPU | INT4 tok/s | Approx. Monthly Cost | tok/s per Pound |
|---|---|---|---|
| RTX 3050 | 28 | ~£45 | 0.62 |
| RTX 4060 | 48 | ~£60 | 0.80 |
| RTX 4060 Ti | 58 | ~£75 | 0.77 |
| RTX 3090 | 72 | ~£110 | 0.65 |
| RTX 5080 | 105 | ~£160 | 0.66 |
| RTX 5090 | 138 | ~£250 | 0.55 |
The RTX 4060 leads on cost efficiency for INT4 LLaMA 3 8B inference. For FP16 quality, the RTX 4060 Ti is the cheapest viable option since the 4060’s 8GB cannot fit FP16 weights.
GPU Recommendations
- INT4 budget: RTX 4060 — best tok/s per pound, 48 tok/s is fast enough for chat.
- FP16 budget: RTX 4060 Ti — cheapest card that fits LLaMA 3 8B in FP16.
- Best speed per cost: RTX 5080 — 105 tok/s INT4 or 72 tok/s FP16 for responsive apps.
- Maximum quality + speed: RTX 5090 — 98 tok/s at FP16, no quality compromise.
For larger LLaMA models, see the RTX 5090 LLaMA 3 70B INT4 guide. For batch size effects on throughput, check our batch size impact benchmark. Compare with other models in our best GPU for LLM inference guide or the tokens per second benchmark. Browse all results in the Benchmarks category.
Conclusion
Quantisation is the single most effective way to speed up LLM inference. INT4 delivers 40-45% more tokens per second than FP16 while using 70% less VRAM. For most applications, INT8 offers the best balance of quality and speed. Choose FP16 only when you need maximum accuracy and have the VRAM to spare, or when serving at high concurrency where quality degradation compounds.
Run LLaMA 3 on Dedicated GPU Servers
Deploy at any precision on bare-metal GPU hardware. UK hosting with full root access.
Browse GPU Servers