A single RTX 4090 24GB cannot run Llama 3.1 70B in FP8 – the model needs ~38GB. Two 4090s with vLLM tensor parallelism can. This article walks through the pairing in detail: how the model splits across both cards, what throughput you actually get, why the scaling caps near 1.6-1.7x rather than 2x, the all-reduce overhead from the absent NVLink, the vLLM launch commands, the heterogeneous-pair anti-pattern, and when 2x 4090 is the right answer versus a single 5090, 6000 Pro or H100. Both cards available via the UK dedicated GPU range.
Contents
- Why pair 4090s
- Tensor parallel for 70B FP8
- Pipeline parallel as an alternative
- The 1.6x scaling cap explained
- vLLM launch commands
- Heterogeneous cards: do not tensor-parallel
- Cost vs alternatives
- Verdict and decision criteria
Why pair 4090s
Three legitimate reasons to put a second 4090 in the chassis:
- Run Llama 70B FP8 quality on consumer-derived hardware. Single 4090 OOMs. Two cards split ~35GB of weights cleanly across TP=2.
- Double aggregate throughput on small-model workloads where you have spare traffic to absorb (replica mode, near-linear scaling).
- Cheaper than one 6000 Pro for ~equivalent VRAM – 2×4090 gives 48GB at ~£1,150/mo versus 96GB 6000 Pro at ~£2,200/mo for similar 70B FP8 throughput.
If none of these apply – your model fits on one card and you do not need replica throughput – skip the pairing entirely and either stay on a single 4090 (covered in the FP8 deployment guide) or upgrade to a 5090 (see the 5090 upgrade post).
Tensor parallel for 70B FP8
vLLM with --tensor-parallel-size 2 shards each transformer layer’s weight matrices column-wise across two devices. Each card holds half the attention heads and half the MLP intermediate dimension. After every transformer block, an all-reduce collective sums the partial activations across both cards. For Llama 70B FP8 the per-card footprint becomes:
| Component | Per-card VRAM | Notes |
|---|---|---|
| Model weights (FP8 split) | ~17.5GB | 35GB of weights / 2 cards |
| KV cache (FP16, 4k context, batch 4) | ~3.5GB | FP8 KV halves this |
| Activation buffers | ~1.5GB | Hidden states, residual connections |
| NCCL communication buffers | ~0.8GB | All-reduce staging |
| CUDA reserved + framework | ~1GB | cuBLAS, cuDNN workspace |
| Per-card total | ~24.3GB | At the absolute edge |
That’s literally at the 24GB limit. Practical deployments use --kv-cache-dtype fp8 to halve KV memory, cap context with --max-model-len 16384, and limit concurrency with --max-num-seqs 4 to keep headroom. The 70B INT4 deployment guide uses AWQ instead and avoids the squeeze entirely.
Pipeline parallel as an alternative
Pipeline parallel splits the model layer-wise instead of tensor-wise. Card 0 holds layers 0-39, card 1 holds layers 40-79. Tokens flow through card 0, get handed to card 1, finish, and stream back. The handoff is a single tensor copy per micro-batch rather than an all-reduce per layer, so PP is dramatically less bandwidth-sensitive than TP – useful when PCIe is the bottleneck.
| Mode | Comms per token | 70B FP8 t/s | Aggregate batch 8 | Best for |
|---|---|---|---|---|
| TP=2 (vLLM default) | ~80 all-reduces (one per layer) | 38-42 t/s decode | ~165 t/s | Throughput-bound, NVLink-equipped hosts |
| PP=2 | ~2 tensor copies | 30-32 t/s decode | ~110 t/s | PCIe-only, sequential workloads |
| Replica (TP=1, two instances) | None | n/a (single-card model) | ~390 t/s for 8B | Small models, separate traffic |
For 70B FP8 production quality on dual 4090 the right answer is usually TP=2 with FP8 KV – the all-reduce overhead is real but you get near-best per-token latency. PP is a fallback when topology is bad.
The 1.6x scaling cap explained
Two 4090s do not give 2x throughput on a TP=2 70B workload. The scaling caps near 1.6-1.74x and the lost ~0.3x is real silicon time. Here is where it goes:
- All-reduce after every layer. Llama 70B has 80 transformer blocks. TP=2 issues an all-reduce after attention and another after the MLP. That is 160 collectives per forward pass.
- PCIe Gen4 x16 peer bandwidth ~32 GB/s. NVLink on H100 is 900 GB/s peer. The 4090 has zero NVLink lanes, so all-reduce traffic crosses the CPU root complex and PCIe.
- Activation tensor size. For 70B at hidden dim 8192, each all-reduce moves ~32 KB per token per layer. At 40 t/s that is ~25 GB/s of comms – 80% of the PCIe Gen4 budget.
- Synchronisation stalls. If one card finishes its half a microsecond ahead of the other, both wait for the all-reduce barrier. Variance compounds over 160 collectives.
| Workload | 1x 4090 t/s | 2x 4090 t/s | Scaling | Why |
|---|---|---|---|---|
| Llama 8B FP8 batch 1 (replica mode) | 198 | 396 | 2.00x | No comms – independent traffic |
| Llama 8B FP8 aggregate batch 32 (replica) | 1,100 | 2,180 | 1.98x | Near-linear |
| Llama 70B FP8 batch 1 (TP=2) | OOM | ~40 | n/a (only TP works) | Splits weights across cards |
| Llama 70B AWQ INT4 batch 1 (TP=2) | 22 | ~38 | 1.73x | 27% comms tax |
| Llama 70B AWQ INT4 concurrency 8 (TP=2) | ~95 aggr | ~165 aggr | 1.74x | Comms tax compounds with batch |
| Llama 70B FP8 concurrency 4 (TP=2) | OOM | ~155 aggr | n/a | ~38 t/s per stream |
For 70B FP8 production quality, dual 4090 lands at roughly the same throughput as a single 5090 32GB but with more VRAM headroom for batch and KV. It is also roughly the same throughput as a single 6000 Pro 96GB at half the price. Compare to single 5090 for the alternative.
vLLM launch commands
The standard tensor-parallel launch for Llama 70B FP8 on dual 4090:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 16384 --max-num-seqs 4 \
--gpu-memory-utilization 0.92
Replica mode for 8B FP8 doubling throughput on independent traffic – launch two single-card instances on different ports:
# Card 0
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct-FP8 \
--port 8001 --gpu-memory-utilization 0.92 &
# Card 1
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct-FP8 \
--port 8002 --gpu-memory-utilization 0.92 &
# Round-robin via Caddy or nginx upstream
Pipeline parallel for the bandwidth-constrained case:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct-FP8 \
--pipeline-parallel-size 2 --tensor-parallel-size 1 \
--max-model-len 16384 --max-num-seqs 8 \
--gpu-memory-utilization 0.92
NCCL environment knobs that matter on PCIe-only hosts: NCCL_P2P_LEVEL=PHB forces peer-to-peer over the host bridge, NCCL_DEBUG=INFO shows which transport is selected. Verify topology with nvidia-smi topo -m – you want PIX or PHB between the two GPUs, not SYS (which routes via QPI/UPI between sockets and slows by ~30%). Full setup is in the vLLM setup guide.
Heterogeneous cards: do not tensor-parallel
If you mix a 4090 with a 5060 Ti, 5090, or 3090 in the same chassis, never put them in the same TP group. Tensor parallel synchronises after every layer, so the slower card stalls the faster one – effective throughput drops to the slower card multiplied by two, which is worse than running on the faster card alone.
| Configuration | Throughput | Verdict |
|---|---|---|
| 2x 4090 TP=2 | ~38 t/s 70B AWQ | OK – matched cards |
| 1x 4090 + 1x 5060 Ti TP=2 | ~22 t/s, 5060 Ti stalls 4090 | Worse than single 4090 |
| 1x 4090 + 1x 5090 TP=2 | ~30 t/s, 4090 stalls 5090 | Worse than single 5090 |
| 1x 4090 + 1x 5060 Ti separate services | Each at full speed independently | Correct pattern |
The right pattern for heterogeneous pairs is service partitioning: LLM on the 4090, image generation or embeddings on the 5060 Ti. The hybrid pairing post walks through the routing patterns.
Cost vs alternatives
| Setup | VRAM | £/mo | Llama 70B FP8 | Llama 8B aggregate t/s | £/M tokens 8B |
|---|---|---|---|---|---|
| 1x 4090 | 24GB | £575 | OOM | ~1,100 | £0.039 |
| 2x 4090 (TP=2 70B FP8) | 48GB | £1,150 | ~38-42 t/s | n/a (TP) | n/a |
| 2x 4090 (replica 8B) | 48GB | £1,150 | n/a | ~2,180 | £0.040 |
| 1x 5090 | 32GB | £900 | ~30 t/s (tight, may OOM) | ~1,700 | £0.041 |
| 1x 6000 Pro | 96GB | £2,200 | ~28-30 t/s comfortable | ~1,400 | £0.060 |
| 1x H100 80GB | 80GB | £2,800 | ~70 t/s, NVLink | ~5,000 | £0.038 |
2x 4090 is the cheapest way to run 70B FP8 on dedicated hardware – cheaper than a 6000 Pro and dramatically cheaper than an H100. Throughput on 70B is comparable to the 6000 Pro, just split across two PCIe slots with a 27% comms tax. For replica-mode small-model traffic, 2x 4090 is also the best £/M-tokens in the table.
Production gotchas
- PCIe topology matters. Two cards on the same root complex (PIX/PHB) get up to 32 GB/s peer; cards on different sockets route via QPI/UPI and slow by ~30%. Check
nvidia-smi topo -mbefore deploying. - PSU sizing. 2x 450W plus host overhead pushes a 1200W PSU close to 90% sustained. Insist on 1600W+ with quality 12V rails. The 4090 also has transient spikes to ~600W.
- Cooling. Two 4090s in a 4U chassis need active rear exhaust. Stack temps climb fast and the second card thermal-throttles before the first.
- NCCL version. Pin NCCL 2.20+ for best collective performance. Earlier versions have known regressions on PCIe-only topologies.
- vLLM
--max-num-seqs. Tempting to push high but each sequence’s KV is duplicated across both TP shards. Start at 4 and grow only after measuring. - Driver MIG warnings. Some driver branches log harmless MIG warnings at boot on consumer cards. Suppress with
NVIDIA_LOG_LEVEL=warnin production. - OOM during warmup. vLLM warmup tries the largest possible batch, can OOM at startup even when steady-state would fit. Use
--enforce-eagerinitially to skip CUDA graph capture during debugging.
Verdict and decision criteria
Pair two 4090s when: you need 70B FP8 quality but cannot justify a 6000 Pro; you have small-model traffic that exceeds 1,800 t/s aggregate on a single card; you want the cheapest dedicated path to ~40 t/s decode on 70B.
Skip the pair when: you only need 70B AWQ INT4 (fits on one 4090); you need >50 t/s decode on 70B (jump to 6000 Pro or H100); you would mix card generations (use them as separate services instead); your host cannot deliver clean PCIe topology between the two slots.
Decision matrix:
| Need | Best option |
|---|---|
| 70B AWQ INT4 only | 1x 4090 |
| 70B FP8 quality, modest scale | 2x 4090 TP=2 |
| 70B FP8 high concurrency | 6000 Pro 96GB or H100 80GB |
| Mixtral 8x22B / Qwen 72B | 6000 Pro 96GB (2x 4090 OOMs) |
| Replica throughput on 8B | 2x 4090 (1.98x scaling) |
| NVLink-bridged training | H100 SXM |
| Mixed model menu | Hybrid 4090 + 5060 Ti, services partitioned |
Two cards, one workload
UK dedicated multi-GPU hosting with 4090 pairs available.
Order the RTX 4090 24GBSee also: Llama 70B INT4 benchmark, hybrid 4090 + 5060 Ti, upgrade to 6000 Pro, vs cloud H100, 70B INT4 VRAM, when to upgrade, vLLM setup, FP8 Llama deployment.