RTX 3050 - Order Now
Home / Blog / Benchmarks / LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec
Benchmarks

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Benchmark comparison of LLaMA 3 8B inference speed at FP16, INT8, and INT4 precision across six GPUs with quality and cost trade-off analysis.

Quantisation Benchmark Overview

LLaMA 3 8B is one of the most popular open LLMs for inference. Running it at different precision levels, FP16 (full), INT8 (8-bit), and INT4 (4-bit), trades quality for speed and VRAM savings. Testing on a dedicated GPU server reveals exactly how much speed you gain from quantisation.

We benchmarked LLaMA 3 8B Instruct on GigaGPU servers using vLLM (FP16, AWQ INT4) and llama.cpp (GGUF Q8_0, Q4_K_M) at batch size 1 with a 512-token prompt and 256-token generation. VRAM requirements range from ~5GB (INT4) to ~16GB (FP16). For detailed memory analysis, see our LLaMA 3 VRAM requirements guide.

Tokens/sec by Precision and GPU

GPUVRAMFP16 (tok/s)INT8 (tok/s)INT4 (tok/s)
RTX 30506 GBN/A (OOM)N/A (OOM)28
RTX 40608 GBN/A (OOM)3248
RTX 4060 Ti16 GB354558
RTX 309024 GB486072
RTX 508016 GB7288105
RTX 509032 GB98118138

INT4 quantisation delivers 30-45% more tokens per second than FP16 on the same GPU. This speedup comes from reduced memory bandwidth requirements, as 4-bit weights transfer twice as fast as 8-bit and four times as fast as 16-bit. The RTX 5080 at INT4 (105 tok/s) outperforms the RTX 3090 at FP16 (48 tok/s) by more than 2x.

Quality Impact of Quantisation

Quantisation reduces model quality. Below we compare perplexity (lower is better) and MMLU accuracy across precision levels.

PrecisionVRAM (weights)Perplexity (WikiText)MMLU AccuracySpeed Gain vs FP16
FP16~16GB6.1466.5%Baseline
INT8 (Q8_0)~8.5GB6.1866.2%~25-30%
INT4 (Q4_K_M)~5GB6.4264.8%~40-45%

INT8 quantisation is nearly lossless, with perplexity increasing by just 0.04 and MMLU dropping by 0.3%. INT4 shows more degradation but remains useful for most applications. For tasks requiring maximum accuracy (coding, reasoning), prefer FP16 or INT8. For chatbots and general Q&A, INT4 is often indistinguishable in practice.

Cost Efficiency Analysis

GPUINT4 tok/sApprox. Monthly Costtok/s per Pound
RTX 305028~£450.62
RTX 406048~£600.80
RTX 4060 Ti58~£750.77
RTX 309072~£1100.65
RTX 5080105~£1600.66
RTX 5090138~£2500.55

The RTX 4060 leads on cost efficiency for INT4 LLaMA 3 8B inference. For FP16 quality, the RTX 4060 Ti is the cheapest viable option since the 4060’s 8GB cannot fit FP16 weights.

GPU Recommendations

  • INT4 budget: RTX 4060 — best tok/s per pound, 48 tok/s is fast enough for chat.
  • FP16 budget: RTX 4060 Ti — cheapest card that fits LLaMA 3 8B in FP16.
  • Best speed per cost: RTX 5080 — 105 tok/s INT4 or 72 tok/s FP16 for responsive apps.
  • Maximum quality + speed: RTX 5090 — 98 tok/s at FP16, no quality compromise.

For larger LLaMA models, see the RTX 5090 LLaMA 3 70B INT4 guide. For batch size effects on throughput, check our batch size impact benchmark. Compare with other models in our best GPU for LLM inference guide or the tokens per second benchmark. Browse all results in the Benchmarks category.

Conclusion

Quantisation is the single most effective way to speed up LLM inference. INT4 delivers 40-45% more tokens per second than FP16 while using 70% less VRAM. For most applications, INT8 offers the best balance of quality and speed. Choose FP16 only when you need maximum accuracy and have the VRAM to spare, or when serving at high concurrency where quality degradation compounds.

Run LLaMA 3 on Dedicated GPU Servers

Deploy at any precision on bare-metal GPU hardware. UK hosting with full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?