Llama 3.3 70B Instruct is Meta’s refresh of the 70B line, bringing performance close to the original Llama 3.1 405B on many benchmarks. On a RTX 6000 Pro 96GB from our dedicated GPU hosting, it is the flagship single-card deployment in 2026.
Contents
Memory Fit
| Precision | Weights | KV Cache at 16k ctx | Total |
|---|---|---|---|
| FP16 | ~140 GB | – | Does not fit |
| FP8 | ~70 GB | ~20 GB for 16 concurrent | ~90 GB total |
| AWQ INT4 | ~40 GB | ~40 GB for 32 concurrent | ~80 GB |
FP8 is the sweet spot on the 6000 Pro – uses Blackwell’s FP8 tensor cores natively and leaves 20+ GB for KV cache. INT4 packs more concurrent sequences at slight quality cost.
Launch
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Llama-3.3-70B-Instruct-FP8 \
--quantization fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching \
--max-num-seqs 24
Concurrency
| Concurrent | Tokens/sec Total | Per-request t/s |
|---|---|---|
| 1 | ~38 | ~38 |
| 8 | ~220 | ~27 |
| 16 | ~360 | ~22 |
| 24 | ~450 | ~19 |
Per-request throughput declines as concurrency rises but aggregate keeps climbing until KV cache saturates.
Alternatives
For lower cost serve AWQ INT4 instead of FP8 – marginal quality loss, same throughput. For higher quality with more cost, two 6000 Pros can serve FP16. For budget 70B hosting without the 6000 Pro see dual 5090 Llama 70B.
Llama 3.3 70B on a Single Card
RTX 6000 Pro UK dedicated hosting with FP8 preconfigured.
Browse GPU ServersCompare against Qwen 2.5 72B – a close competitor at similar size.