Home / Blog / Benchmarks / LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Benchmarks

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Benchmark comparison of LLaMA 3 8B inference speed at FP16, INT8, and INT4 precision across six GPUs with quality and cost trade-off analysis.

Benchmarks April 14, 2026 3 min read admin

Table of Contents

Quantisation Benchmark Overview
Tokens/sec by Precision and GPU
Quality Impact of Quantisation
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Quantisation Benchmark Overview

LLaMA 3 8B is one of the most popular open LLMs for inference. Running it at different precision levels, FP16 (full), INT8 (8-bit), and INT4 (4-bit), trades quality for speed and VRAM savings. Testing on a dedicated GPU server reveals exactly how much speed you gain from quantisation.

We benchmarked LLaMA 3 8B Instruct on GigaGPU servers using vLLM (FP16, AWQ INT4) and llama.cpp (GGUF Q8_0, Q4_K_M) at batch size 1 with a 512-token prompt and 256-token generation. VRAM requirements range from ~5GB (INT4) to ~16GB (FP16). For detailed memory analysis, see our LLaMA 3 VRAM requirements guide.

Tokens/sec by Precision and GPU

GPU	VRAM	FP16 (tok/s)	INT8 (tok/s)	INT4 (tok/s)
RTX 3050	6 GB	N/A (OOM)	N/A (OOM)	28
RTX 4060	8 GB	N/A (OOM)	32	48
RTX 4060 Ti	16 GB	35	45	58
RTX 3090	24 GB	48	60	72
RTX 5080	16 GB	72	88	105
RTX 5090	32 GB	98	118	138

INT4 quantisation delivers 30-45% more tokens per second than FP16 on the same GPU. This speedup comes from reduced memory bandwidth requirements, as 4-bit weights transfer twice as fast as 8-bit and four times as fast as 16-bit. The RTX 5080 at INT4 (105 tok/s) outperforms the RTX 3090 at FP16 (48 tok/s) by more than 2x.

Quality Impact of Quantisation

Quantisation reduces model quality. Below we compare perplexity (lower is better) and MMLU accuracy across precision levels.

Precision	VRAM (weights)	Perplexity (WikiText)	MMLU Accuracy	Speed Gain vs FP16
FP16	~16GB	6.14	66.5%	Baseline
INT8 (Q8_0)	~8.5GB	6.18	66.2%	~25-30%
INT4 (Q4_K_M)	~5GB	6.42	64.8%	~40-45%

INT8 quantisation is nearly lossless, with perplexity increasing by just 0.04 and MMLU dropping by 0.3%. INT4 shows more degradation but remains useful for most applications. For tasks requiring maximum accuracy (coding, reasoning), prefer FP16 or INT8. For chatbots and general Q&A, INT4 is often indistinguishable in practice.

Cost Efficiency Analysis

GPU	INT4 tok/s	Approx. Monthly Cost	tok/s per Pound
RTX 3050	28	~£45	0.62
RTX 4060	48	~£60	0.80
RTX 4060 Ti	58	~£75	0.77
RTX 3090	72	~£110	0.65
RTX 5080	105	~£160	0.66
RTX 5090	138	~£250	0.55

The RTX 4060 leads on cost efficiency for INT4 LLaMA 3 8B inference. For FP16 quality, the RTX 4060 Ti is the cheapest viable option since the 4060’s 8GB cannot fit FP16 weights.

GPU Recommendations

INT4 budget: RTX 4060 — best tok/s per pound, 48 tok/s is fast enough for chat.
FP16 budget: RTX 4060 Ti — cheapest card that fits LLaMA 3 8B in FP16.
Best speed per cost: RTX 5080 — 105 tok/s INT4 or 72 tok/s FP16 for responsive apps.
Maximum quality + speed: RTX 5090 — 98 tok/s at FP16, no quality compromise.

For larger LLaMA models, see the RTX 5090 LLaMA 3 70B INT4 guide. For batch size effects on throughput, check our batch size impact benchmark. Compare with other models in our best GPU for LLM inference guide or the tokens per second benchmark. Browse all results in the Benchmarks category.

Conclusion

Quantisation is the single most effective way to speed up LLM inference. INT4 delivers 40-45% more tokens per second than FP16 while using 70% less VRAM. For most applications, INT8 offers the best balance of quality and speed. Choose FP16 only when you need maximum accuracy and have the VRAM to spare, or when serving at high concurrency where quality degradation compounds.

Run LLaMA 3 on Dedicated GPU Servers

Deploy at any precision on bare-metal GPU hardware. UK hosting with full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Quantisation Benchmark Overview

Tokens/sec by Precision and GPU

Quality Impact of Quantisation

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Run LLaMA 3 on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Quantisation Benchmark Overview

Tokens/sec by Precision and GPU

Quality Impact of Quantisation

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Run LLaMA 3 on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

CUDA Graph Optimization for Inference

Voice Agent Round-Trip Latency by GPU

LoRA Fine-Tuning Speed by GPU

Stable Diffusion XL on RTX 4060 Ti: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-4060-ti-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 4060 Ti: 1.9 it/s, 3.8 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?