RTX 3050 - Order Now
Home / Blog / Tutorials / Deploying Llama 3.1 8B at FP8 on the RTX 4090 24GB: A Production Tutorial
Tutorials

Deploying Llama 3.1 8B at FP8 on the RTX 4090 24GB: A Production Tutorial

Native E4M3 FP8 weights and FP8 KV cache deliver 195 t/s decode and 1100 t/s aggregate on Llama 3.1 8B; this is the senior infra walkthrough with monitoring, errors and verification.

The RTX 4090’s 4th-generation tensor cores execute native FP8 (E4M3 and E5M2) GEMMs at twice the rate of FP16 with half the bandwidth and half the KV cache footprint. That makes Llama 3.1 8B at FP8 the highest tokens-per-watt and highest tokens-per-pound configuration this card runs, by some margin: 195 t/s decode at batch 1, 880 t/s at batch 8, 1,100 t/s aggregate at batch 32, with 22 GB resident and 350 W under steady decode. This tutorial deploys it cleanly on a RTX 4090 24GB dedicated server, walks the why behind every flag, and covers verification, common errors, monitoring hooks and the throughput numbers you should see on day one. For the wider hardware menu see dedicated GPU hosting.

Contents

Why FP8 on Ada specifically

FP8 on Ada is not the same as FP8 emulated on older silicon. The 4090 has dedicated 4th-generation tensor cores that natively execute E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa) matrix multiplications. The hardware accumulates in FP16 and writes back to FP8 with stochastic rounding. The result is double the GEMM throughput of FP16 at exactly half the memory traffic. Llama 3.1’s distribution of activations is dominated by E4M3-friendly ranges, so vLLM’s default FP8 path uses E4M3 for both weights and KV cache. Quality cost is negligible: in our internal evals, MMLU drops 0.04 points, HumanEval drops 0.0 points, and ROUGE-L on summarisation is statistically flat. See the deeper background at FP8 tensor cores on Ada.

Compared to AWQ INT4, FP8 keeps the model in floating point throughout the forward pass, preserving subtler quality at a small VRAM cost (1 byte per weight versus AWQ’s 0.5 byte). For 7B-13B models that fit comfortably in 24 GB even at FP8, that quality advantage is essentially free. AWQ INT4 wins only when you need 14B+ in the same envelope; below that, FP8 is the right pick. The decision matrix is in the AWQ deep dive.

Prerequisites and platform check

Confirm three things before proceeding. First, the vLLM setup tutorial has been completed: NVIDIA driver 550 or above, CUDA 12.4, vLLM 0.6.3 inside a Python 3.11 virtual environment. Second, your Hugging Face token has accepted the Llama 3.1 community licence (the model is gated and the download will 403 silently otherwise). Third, your GPU is the 24 GB Ada AD102, not a rebadged 4080 or workstation variant — nvidia-smi should report NVIDIA GeForce RTX 4090 and 24,564 MiB total. Compute capability must be 8.9; check with nvidia-smi --query-gpu=compute_cap --format=csv.

The deploy, line by line

source ~/vllm-env/bin/activate
export HF_TOKEN=hf_yourtoken
export HF_HUB_ENABLE_HF_TRANSFER=1

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 65536 \
  --max-num-seqs 32 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --port 8000

Each flag, with the why. HF_HUB_ENABLE_HF_TRANSFER=1 activates the parallel multipart Rust downloader: the 16 GB FP16 source weights stream in roughly 80 seconds on a 1 Gbps link versus 4-5 minutes with the Python downloader. --model meta-llama/Llama-3.1-8B-Instruct uses the canonical Meta release; vLLM 0.6+ does the FP8 quantisation on the fly during load using the activation statistics shipped in the model card. --quantization fp8 is the single flag that activates the Ada FP8 GEMM path; without it you get FP16 weights and roughly half the throughput. --kv-cache-dtype fp8 halves KV memory so 32 concurrent sequences fit at 64k context — without it the same configuration consumes ~13 GB of KV alone, leaving no room for activations or the spike absorber. --max-model-len 65536 bounds per-sequence allocation; 64k is the sweet spot for memory budget and is generous for almost every workload. --max-num-seqs 32 caps continuous batching; sized to keep aggregate KV under the budget at average context length. --enable-chunked-prefill interleaves prefill chunks with decode steps so a 30k-token prompt does not stall a 200-token reply. --enable-prefix-caching hashes incoming token prefixes and reuses computed KV blocks, often cutting RAG prefill cost by 30-70%. --gpu-memory-utilization 0.92 tells vLLM to size its KV pool to 92% of VRAM, leaving 2 GB for spikes; aggressive but safe at this configuration.

First boot takes 2-3 minutes for the FP8 conversion plus the source download. Subsequent boots reuse the cached FP8 weights at ~/.cache/huggingface and start in 30-45 seconds. The startup log is your verification surface — watch for quantization: fp8 and KV cache dtype: fp8_e4m3. If either line says anything else, stop and fix the underlying issue rather than continuing.

Verification checklist with expected output

CheckCommand or signalExpected
FP8 GEMMs in useStartup logquantization: fp8
FP8 KV cacheStartup logKV cache dtype: fp8_e4m3
VRAM usage steadynvidia-smi~22.0 of 24.5 GB
Power steady decodenvidia-smi340-360 W
Temperature steadynvidia-smi70-78 degrees C
Endpoint respondscurl /v1/modelsJSON with model id
Throughput sanityvLLM benchmark~195 t/s decode batch 1
Prefix cache activeRepeat same prompt twicesecond call ~30-50% faster prefill

A quick chat sanity check:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
       "messages":[{"role":"user","content":"Say hello in 5 languages."}],
       "max_tokens":120, "temperature":0.2}' | jq .

Expected wall time on localhost: 700-800 ms total. The usage field will report ~140 output tokens; at 195 t/s that is ~720 ms of pure decode plus ~50 ms prefill plus minimal network. If your wall time is consistently above 1.5 seconds for this payload, decode is throttling — check temperature, power and GPU utilisation. If the response is empty or contains nonsense English, the chat template did not load correctly; verify chat_template is present in the tokenizer config.

Common errors and exact fixes

SymptomCauseFix
OOM at startupgpu-memory-utilization too high or stale CUDA processnvidia-smi, kill stragglers; drop to 0.90
“unsupported quantization fp8”vLLM < 0.5.4pip install -U vllm==0.6.3
“FP8 KV not supported on this device”Driver < 550 or compute capability mismatchUpgrade driver, reboot, confirm sm_89
Slow prefill on long promptschunked-prefill disabledAdd --enable-chunked-prefill
HF auth fail (403)Missing token or licence not acceptedSet HF_TOKEN; accept Llama 3.1 licence on huggingface.co
Garbled output, all caps or repetitionWrong chat templateCheck tokenizer config has chat_template set
Decode below 150 t/sPower or thermal throttlingCap nvidia-smi -pl 400; check inlet temperature
p99 latency spikes after 30 minutesSustained thermal limitSee thermal performance

Monitoring hooks for production

vLLM exposes Prometheus metrics on /metrics. The four most useful are:

  • vllm:gpu_cache_usage_perc > 90% sustained for 60 seconds: KV cache is thrashing. Lower --max-num-seqs from 32 to 24, or shorten --max-model-len.
  • vllm:num_requests_waiting > 4 sustained: continuous batching cannot absorb the load. You are at capacity; scale out to a second card via multi-card pairing.
  • vllm:time_to_first_token_seconds p95 > 1.0 s: prefill saturated. Enable chunked prefill (already on in this config) or trim system prompts.
  • vllm:time_per_output_token_seconds p95 > 0.012 s (=83 t/s): decode slowed. Check nvidia-smi power draw and temperature for throttling.

nvidia-smi-derived metrics worth scraping with the nvidia-dcgm-exporter: GPU utilisation, memory used, power draw, temperature, fan speed. The 4090 should sit at 70-78 degrees C under sustained load; over 83 degrees the card down-clocks and decode drops. Cap power preemptively at 400 W via nvidia-smi -pl 400 for steadier latency at a 3-4% throughput cost. See power draw and efficiency and tokens per watt.

Throughput numbers you should see

MetricValueNotes
Decode t/s, batch 1, 1k ctx~195Memory-bandwidth limited
Decode t/s, batch 8~880 aggregateCompute starting to dominate
Aggregate t/s, batch 16~1,020Approaching saturation
Aggregate t/s, batch 32~1,100Saturation
Aggregate t/s, batch 64~1,140Marginal gain only
TTFT, 4k ctx, batch 1~210 msPrefill at ~19,000 tok/s
TTFT, 32k ctx, batch 1~1.2 sLinear with context
VRAM steady~22.0 GB92% utilisation
Power steady (decode)~340 WBelow 400 W cap
Concurrent SLA-compliant users~30 activeSub-2s p95 reply

Cross-reference the full curve in the Llama 3 8B benchmark, the prefill/decode benchmark and the concurrent users page. For the use cases this configuration unlocks see Llama 3 8B use case, customer support, SaaS RAG and startup MVP.

Production gotchas and verdict

  • FP8 quantisation happens at load time, not pre-built. The first boot of a fresh model takes 2-3 minutes for the conversion. Bake this into your health-check timeouts.
  • FP8 KV silently falls back to FP16 on driver < 550. The model still loads, throughput looks sane initially, then OOM hits at the 4,000th decoded token of a long context. Always verify the startup log.
  • Prefix caching plus per-tenant data is a leakage surface. If you serve multiple tenants from one endpoint, namespace cache keys at the gateway. See multi-tenant SaaS.
  • chunked prefill is essential under bimodal traffic. A single 30k-token prompt without chunked prefill will stall every other request for ~1.6 seconds.
  • The Hugging Face cache balloons silently. Each model variant downloads a fresh FP16 source plus a quantised cache. Set HF_HOME to a 100 GB+ volume.
  • vLLM Prometheus metrics are not enabled by default in the OpenAI server before 0.6. Pin the version and confirm /metrics returns plaintext metrics on a fresh deploy.
  • Decode below 150 t/s is almost always thermal or power. If you see it sustained, check rack inlet temperature and the nvidia-smi --query-gpu=clocks_throttle_reasons.active field rather than tuning vLLM flags.

Verdict. The Llama 3.1 8B FP8 deployment described here is the correct production posture for the vast majority of self-hosted LLM workloads on a 24 GB Ada card: chat backends, RAG frontends, customer support, agent inner loops, content moderation routing, code completion. It scales to roughly 30 SLA-compliant concurrent users at sub-2 second p95, absorbs daily volumes of 12,000-22,000 sessions per card, and amortises a fixed monthly server rental against API alternatives in under three weeks for typical traffic — see vs OpenAI API cost and the monthly hosting cost page. Step up to AWQ INT4 (AWQ guide) only when you need 14B+ for quality reasons; step up to a second card or a 5090 (5090 decision) only when concurrent demand exceeds 30 active sessions sustained.

Llama 3 8B FP8 in production on the 4090

Ada FP8 cores at ~195 t/s decode, 22 GB resident, 30 concurrent users. UK dedicated hosting.

Order the RTX 4090 24GB

See also: vLLM setup, AWQ guide, FP8 tensor cores on Ada, Llama 8B benchmark, 70B INT4 deployment, first day checklist, spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?