Deploying Llama 3.1 8B at FP8 on the RTX 4090 24GB: A Production Tutorial GIGAGPU

The RTX 4090’s 4th-generation tensor cores execute native FP8 (E4M3 and E5M2) GEMMs at twice the rate of FP16 with half the bandwidth and half the KV cache footprint. That makes Llama 3.1 8B at FP8 the highest tokens-per-watt and highest tokens-per-pound configuration this card runs, by some margin: 195 t/s decode at batch 1, 880 t/s at batch 8, 1,100 t/s aggregate at batch 32, with 22 GB resident and 350 W under steady decode. This tutorial deploys it cleanly on a RTX 4090 24GB dedicated server, walks the why behind every flag, and covers verification, common errors, monitoring hooks and the throughput numbers you should see on day one. For the wider hardware menu see dedicated GPU hosting.

Why FP8 on Ada specifically

FP8 on Ada is not the same as FP8 emulated on older silicon. The 4090 has dedicated 4th-generation tensor cores that natively execute E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa) matrix multiplications. The hardware accumulates in FP16 and writes back to FP8 with stochastic rounding. The result is double the GEMM throughput of FP16 at exactly half the memory traffic. Llama 3.1’s distribution of activations is dominated by E4M3-friendly ranges, so vLLM’s default FP8 path uses E4M3 for both weights and KV cache. Quality cost is negligible: in our internal evals, MMLU drops 0.04 points, HumanEval drops 0.0 points, and ROUGE-L on summarisation is statistically flat. See the deeper background at FP8 tensor cores on Ada.

Compared to AWQ INT4, FP8 keeps the model in floating point throughout the forward pass, preserving subtler quality at a small VRAM cost (1 byte per weight versus AWQ’s 0.5 byte). For 7B-13B models that fit comfortably in 24 GB even at FP8, that quality advantage is essentially free. AWQ INT4 wins only when you need 14B+ in the same envelope; below that, FP8 is the right pick. The decision matrix is in the AWQ deep dive.

Prerequisites and platform check

Confirm three things before proceeding. First, the vLLM setup tutorial has been completed: NVIDIA driver 550 or above, CUDA 12.4, vLLM 0.6.3 inside a Python 3.11 virtual environment. Second, your Hugging Face token has accepted the Llama 3.1 community licence (the model is gated and the download will 403 silently otherwise). Third, your GPU is the 24 GB Ada AD102, not a rebadged 4080 or workstation variant — nvidia-smi should report NVIDIA GeForce RTX 4090 and 24,564 MiB total. Compute capability must be 8.9; check with nvidia-smi --query-gpu=compute_cap --format=csv.

The deploy, line by line

source ~/vllm-env/bin/activate
export HF_TOKEN=hf_yourtoken
export HF_HUB_ENABLE_HF_TRANSFER=1

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 65536 \
  --max-num-seqs 32 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --port 8000

Each flag, with the why. HF_HUB_ENABLE_HF_TRANSFER=1 activates the parallel multipart Rust downloader: the 16 GB FP16 source weights stream in roughly 80 seconds on a 1 Gbps link versus 4-5 minutes with the Python downloader. --model meta-llama/Llama-3.1-8B-Instruct uses the canonical Meta release; vLLM 0.6+ does the FP8 quantisation on the fly during load using the activation statistics shipped in the model card. --quantization fp8 is the single flag that activates the Ada FP8 GEMM path; without it you get FP16 weights and roughly half the throughput. --kv-cache-dtype fp8 halves KV memory so 32 concurrent sequences fit at 64k context — without it the same configuration consumes ~13 GB of KV alone, leaving no room for activations or the spike absorber. --max-model-len 65536 bounds per-sequence allocation; 64k is the sweet spot for memory budget and is generous for almost every workload. --max-num-seqs 32 caps continuous batching; sized to keep aggregate KV under the budget at average context length. --enable-chunked-prefill interleaves prefill chunks with decode steps so a 30k-token prompt does not stall a 200-token reply. --enable-prefix-caching hashes incoming token prefixes and reuses computed KV blocks, often cutting RAG prefill cost by 30-70%. --gpu-memory-utilization 0.92 tells vLLM to size its KV pool to 92% of VRAM, leaving 2 GB for spikes; aggressive but safe at this configuration.

First boot takes 2-3 minutes for the FP8 conversion plus the source download. Subsequent boots reuse the cached FP8 weights at ~/.cache/huggingface and start in 30-45 seconds. The startup log is your verification surface — watch for quantization: fp8 and KV cache dtype: fp8_e4m3. If either line says anything else, stop and fix the underlying issue rather than continuing.

Verification checklist with expected output

Check	Command or signal	Expected
FP8 GEMMs in use	Startup log	`quantization: fp8`
FP8 KV cache	Startup log	`KV cache dtype: fp8_e4m3`
VRAM usage steady	`nvidia-smi`	~22.0 of 24.5 GB
Power steady decode	`nvidia-smi`	340-360 W
Temperature steady	`nvidia-smi`	70-78 degrees C
Endpoint responds	`curl /v1/models`	JSON with model id
Throughput sanity	vLLM benchmark	~195 t/s decode batch 1
Prefix cache active	Repeat same prompt twice	second call ~30-50% faster prefill

A quick chat sanity check:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
       "messages":[{"role":"user","content":"Say hello in 5 languages."}],
       "max_tokens":120, "temperature":0.2}' | jq .

Expected wall time on localhost: 700-800 ms total. The usage field will report ~140 output tokens; at 195 t/s that is ~720 ms of pure decode plus ~50 ms prefill plus minimal network. If your wall time is consistently above 1.5 seconds for this payload, decode is throttling — check temperature, power and GPU utilisation. If the response is empty or contains nonsense English, the chat template did not load correctly; verify chat_template is present in the tokenizer config.

Common errors and exact fixes

Symptom	Cause	Fix
OOM at startup	gpu-memory-utilization too high or stale CUDA process	`nvidia-smi`, kill stragglers; drop to 0.90
“unsupported quantization fp8”	vLLM < 0.5.4	`pip install -U vllm==0.6.3`
“FP8 KV not supported on this device”	Driver < 550 or compute capability mismatch	Upgrade driver, reboot, confirm sm_89
Slow prefill on long prompts	chunked-prefill disabled	Add `--enable-chunked-prefill`
HF auth fail (403)	Missing token or licence not accepted	Set `HF_TOKEN`; accept Llama 3.1 licence on huggingface.co
Garbled output, all caps or repetition	Wrong chat template	Check tokenizer config has `chat_template` set
Decode below 150 t/s	Power or thermal throttling	Cap `nvidia-smi -pl 400`; check inlet temperature
p99 latency spikes after 30 minutes	Sustained thermal limit	See thermal performance

Monitoring hooks for production

vLLM exposes Prometheus metrics on /metrics. The four most useful are:

vllm:gpu_cache_usage_perc > 90% sustained for 60 seconds: KV cache is thrashing. Lower --max-num-seqs from 32 to 24, or shorten --max-model-len.
vllm:num_requests_waiting > 4 sustained: continuous batching cannot absorb the load. You are at capacity; scale out to a second card via multi-card pairing.
vllm:time_to_first_token_seconds p95 > 1.0 s: prefill saturated. Enable chunked prefill (already on in this config) or trim system prompts.
vllm:time_per_output_token_seconds p95 > 0.012 s (=83 t/s): decode slowed. Check nvidia-smi power draw and temperature for throttling.

nvidia-smi-derived metrics worth scraping with the nvidia-dcgm-exporter: GPU utilisation, memory used, power draw, temperature, fan speed. The 4090 should sit at 70-78 degrees C under sustained load; over 83 degrees the card down-clocks and decode drops. Cap power preemptively at 400 W via nvidia-smi -pl 400 for steadier latency at a 3-4% throughput cost. See power draw and efficiency and tokens per watt.

Throughput numbers you should see

Metric	Value	Notes
Decode t/s, batch 1, 1k ctx	~195	Memory-bandwidth limited
Decode t/s, batch 8	~880 aggregate	Compute starting to dominate
Aggregate t/s, batch 16	~1,020	Approaching saturation
Aggregate t/s, batch 32	~1,100	Saturation
Aggregate t/s, batch 64	~1,140	Marginal gain only
TTFT, 4k ctx, batch 1	~210 ms	Prefill at ~19,000 tok/s
TTFT, 32k ctx, batch 1	~1.2 s	Linear with context
VRAM steady	~22.0 GB	92% utilisation
Power steady (decode)	~340 W	Below 400 W cap
Concurrent SLA-compliant users	~30 active	Sub-2s p95 reply

Cross-reference the full curve in the Llama 3 8B benchmark, the prefill/decode benchmark and the concurrent users page. For the use cases this configuration unlocks see Llama 3 8B use case, customer support, SaaS RAG and startup MVP.

Production gotchas and verdict

FP8 quantisation happens at load time, not pre-built. The first boot of a fresh model takes 2-3 minutes for the conversion. Bake this into your health-check timeouts.
FP8 KV silently falls back to FP16 on driver < 550. The model still loads, throughput looks sane initially, then OOM hits at the 4,000th decoded token of a long context. Always verify the startup log.
Prefix caching plus per-tenant data is a leakage surface. If you serve multiple tenants from one endpoint, namespace cache keys at the gateway. See multi-tenant SaaS.
chunked prefill is essential under bimodal traffic. A single 30k-token prompt without chunked prefill will stall every other request for ~1.6 seconds.
The Hugging Face cache balloons silently. Each model variant downloads a fresh FP16 source plus a quantised cache. Set HF_HOME to a 100 GB+ volume.
vLLM Prometheus metrics are not enabled by default in the OpenAI server before 0.6. Pin the version and confirm /metrics returns plaintext metrics on a fresh deploy.
Decode below 150 t/s is almost always thermal or power. If you see it sustained, check rack inlet temperature and the nvidia-smi --query-gpu=clocks_throttle_reasons.active field rather than tuning vLLM flags.

Verdict. The Llama 3.1 8B FP8 deployment described here is the correct production posture for the vast majority of self-hosted LLM workloads on a 24 GB Ada card: chat backends, RAG frontends, customer support, agent inner loops, content moderation routing, code completion. It scales to roughly 30 SLA-compliant concurrent users at sub-2 second p95, absorbs daily volumes of 12,000-22,000 sessions per card, and amortises a fixed monthly server rental against API alternatives in under three weeks for typical traffic — see vs OpenAI API cost and the monthly hosting cost page. Step up to AWQ INT4 (AWQ guide) only when you need 14B+ for quality reasons; step up to a second card or a 5090 (5090 decision) only when concurrent demand exceeds 30 active sessions sustained.

Llama 3 8B FP8 in production on the 4090

Ada FP8 cores at ~195 t/s decode, 22 GB resident, 30 concurrent users. UK dedicated hosting.

Order the RTX 4090 24GB

Deploying Llama 3.1 8B at FP8 on the RTX 4090 24GB: A Production Tutorial

Contents

Why FP8 on Ada specifically

Prerequisites and platform check

The deploy, line by line

Verification checklist with expected output

Common errors and exact fixes

Monitoring hooks for production

Throughput numbers you should see

Production gotchas and verdict

Llama 3 8B FP8 in production on the 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Deploying Llama 3.1 8B at FP8 on the RTX 4090 24GB: A Production Tutorial

Contents

Why FP8 on Ada specifically

Prerequisites and platform check

The deploy, line by line

Verification checklist with expected output

Common errors and exact fixes

Monitoring hooks for production

Throughput numbers you should see

Production gotchas and verdict

Llama 3 8B FP8 in production on the 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

GPU Power Management on a Dedicated Server

Migrate from Lambda to Dedicated GPU: Dataset Processing

Speculative Decoding on the RTX 5060 Ti 16 GB: 1.6× Speedup for Free

Caching Strategies for LLM Inference: Beyond Prefix Caching

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?