RTX 3050 - Order Now
Home / Blog / Alternatives / 2x RTX 4090 24GB Pairing for Tensor Parallel Inference
Alternatives

2x RTX 4090 24GB Pairing for Tensor Parallel Inference

Pairing two RTX 4090s for tensor-parallel Llama 70B FP8 inference - VRAM split, 1.6x scaling cap explained, vLLM commands, PCIe coordination tax, and the NVLink absent caveat.

A single RTX 4090 24GB cannot run Llama 3.1 70B in FP8 – the model needs ~38GB. Two 4090s with vLLM tensor parallelism can. This article walks through the pairing in detail: how the model splits across both cards, what throughput you actually get, why the scaling caps near 1.6-1.7x rather than 2x, the all-reduce overhead from the absent NVLink, the vLLM launch commands, the heterogeneous-pair anti-pattern, and when 2x 4090 is the right answer versus a single 5090, 6000 Pro or H100. Both cards available via the UK dedicated GPU range.

Contents

Why pair 4090s

Three legitimate reasons to put a second 4090 in the chassis:

  1. Run Llama 70B FP8 quality on consumer-derived hardware. Single 4090 OOMs. Two cards split ~35GB of weights cleanly across TP=2.
  2. Double aggregate throughput on small-model workloads where you have spare traffic to absorb (replica mode, near-linear scaling).
  3. Cheaper than one 6000 Pro for ~equivalent VRAM – 2×4090 gives 48GB at ~£1,150/mo versus 96GB 6000 Pro at ~£2,200/mo for similar 70B FP8 throughput.

If none of these apply – your model fits on one card and you do not need replica throughput – skip the pairing entirely and either stay on a single 4090 (covered in the FP8 deployment guide) or upgrade to a 5090 (see the 5090 upgrade post).

Tensor parallel for 70B FP8

vLLM with --tensor-parallel-size 2 shards each transformer layer’s weight matrices column-wise across two devices. Each card holds half the attention heads and half the MLP intermediate dimension. After every transformer block, an all-reduce collective sums the partial activations across both cards. For Llama 70B FP8 the per-card footprint becomes:

ComponentPer-card VRAMNotes
Model weights (FP8 split)~17.5GB35GB of weights / 2 cards
KV cache (FP16, 4k context, batch 4)~3.5GBFP8 KV halves this
Activation buffers~1.5GBHidden states, residual connections
NCCL communication buffers~0.8GBAll-reduce staging
CUDA reserved + framework~1GBcuBLAS, cuDNN workspace
Per-card total~24.3GBAt the absolute edge

That’s literally at the 24GB limit. Practical deployments use --kv-cache-dtype fp8 to halve KV memory, cap context with --max-model-len 16384, and limit concurrency with --max-num-seqs 4 to keep headroom. The 70B INT4 deployment guide uses AWQ instead and avoids the squeeze entirely.

Pipeline parallel as an alternative

Pipeline parallel splits the model layer-wise instead of tensor-wise. Card 0 holds layers 0-39, card 1 holds layers 40-79. Tokens flow through card 0, get handed to card 1, finish, and stream back. The handoff is a single tensor copy per micro-batch rather than an all-reduce per layer, so PP is dramatically less bandwidth-sensitive than TP – useful when PCIe is the bottleneck.

ModeComms per token70B FP8 t/sAggregate batch 8Best for
TP=2 (vLLM default)~80 all-reduces (one per layer)38-42 t/s decode~165 t/sThroughput-bound, NVLink-equipped hosts
PP=2~2 tensor copies30-32 t/s decode~110 t/sPCIe-only, sequential workloads
Replica (TP=1, two instances)Nonen/a (single-card model)~390 t/s for 8BSmall models, separate traffic

For 70B FP8 production quality on dual 4090 the right answer is usually TP=2 with FP8 KV – the all-reduce overhead is real but you get near-best per-token latency. PP is a fallback when topology is bad.

The 1.6x scaling cap explained

Two 4090s do not give 2x throughput on a TP=2 70B workload. The scaling caps near 1.6-1.74x and the lost ~0.3x is real silicon time. Here is where it goes:

  • All-reduce after every layer. Llama 70B has 80 transformer blocks. TP=2 issues an all-reduce after attention and another after the MLP. That is 160 collectives per forward pass.
  • PCIe Gen4 x16 peer bandwidth ~32 GB/s. NVLink on H100 is 900 GB/s peer. The 4090 has zero NVLink lanes, so all-reduce traffic crosses the CPU root complex and PCIe.
  • Activation tensor size. For 70B at hidden dim 8192, each all-reduce moves ~32 KB per token per layer. At 40 t/s that is ~25 GB/s of comms – 80% of the PCIe Gen4 budget.
  • Synchronisation stalls. If one card finishes its half a microsecond ahead of the other, both wait for the all-reduce barrier. Variance compounds over 160 collectives.
Workload1x 4090 t/s2x 4090 t/sScalingWhy
Llama 8B FP8 batch 1 (replica mode)1983962.00xNo comms – independent traffic
Llama 8B FP8 aggregate batch 32 (replica)1,1002,1801.98xNear-linear
Llama 70B FP8 batch 1 (TP=2)OOM~40n/a (only TP works)Splits weights across cards
Llama 70B AWQ INT4 batch 1 (TP=2)22~381.73x27% comms tax
Llama 70B AWQ INT4 concurrency 8 (TP=2)~95 aggr~165 aggr1.74xComms tax compounds with batch
Llama 70B FP8 concurrency 4 (TP=2)OOM~155 aggrn/a~38 t/s per stream

For 70B FP8 production quality, dual 4090 lands at roughly the same throughput as a single 5090 32GB but with more VRAM headroom for batch and KV. It is also roughly the same throughput as a single 6000 Pro 96GB at half the price. Compare to single 5090 for the alternative.

vLLM launch commands

The standard tensor-parallel launch for Llama 70B FP8 on dual 4090:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 --max-num-seqs 4 \
  --gpu-memory-utilization 0.92

Replica mode for 8B FP8 doubling throughput on independent traffic – launch two single-card instances on different ports:

# Card 0
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --port 8001 --gpu-memory-utilization 0.92 &

# Card 1
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --port 8002 --gpu-memory-utilization 0.92 &

# Round-robin via Caddy or nginx upstream

Pipeline parallel for the bandwidth-constrained case:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct-FP8 \
  --pipeline-parallel-size 2 --tensor-parallel-size 1 \
  --max-model-len 16384 --max-num-seqs 8 \
  --gpu-memory-utilization 0.92

NCCL environment knobs that matter on PCIe-only hosts: NCCL_P2P_LEVEL=PHB forces peer-to-peer over the host bridge, NCCL_DEBUG=INFO shows which transport is selected. Verify topology with nvidia-smi topo -m – you want PIX or PHB between the two GPUs, not SYS (which routes via QPI/UPI between sockets and slows by ~30%). Full setup is in the vLLM setup guide.

Heterogeneous cards: do not tensor-parallel

If you mix a 4090 with a 5060 Ti, 5090, or 3090 in the same chassis, never put them in the same TP group. Tensor parallel synchronises after every layer, so the slower card stalls the faster one – effective throughput drops to the slower card multiplied by two, which is worse than running on the faster card alone.

ConfigurationThroughputVerdict
2x 4090 TP=2~38 t/s 70B AWQOK – matched cards
1x 4090 + 1x 5060 Ti TP=2~22 t/s, 5060 Ti stalls 4090Worse than single 4090
1x 4090 + 1x 5090 TP=2~30 t/s, 4090 stalls 5090Worse than single 5090
1x 4090 + 1x 5060 Ti separate servicesEach at full speed independentlyCorrect pattern

The right pattern for heterogeneous pairs is service partitioning: LLM on the 4090, image generation or embeddings on the 5060 Ti. The hybrid pairing post walks through the routing patterns.

Cost vs alternatives

SetupVRAM£/moLlama 70B FP8Llama 8B aggregate t/s£/M tokens 8B
1x 409024GB£575OOM~1,100£0.039
2x 4090 (TP=2 70B FP8)48GB£1,150~38-42 t/sn/a (TP)n/a
2x 4090 (replica 8B)48GB£1,150n/a~2,180£0.040
1x 509032GB£900~30 t/s (tight, may OOM)~1,700£0.041
1x 6000 Pro96GB£2,200~28-30 t/s comfortable~1,400£0.060
1x H100 80GB80GB£2,800~70 t/s, NVLink~5,000£0.038

2x 4090 is the cheapest way to run 70B FP8 on dedicated hardware – cheaper than a 6000 Pro and dramatically cheaper than an H100. Throughput on 70B is comparable to the 6000 Pro, just split across two PCIe slots with a 27% comms tax. For replica-mode small-model traffic, 2x 4090 is also the best £/M-tokens in the table.

Production gotchas

  1. PCIe topology matters. Two cards on the same root complex (PIX/PHB) get up to 32 GB/s peer; cards on different sockets route via QPI/UPI and slow by ~30%. Check nvidia-smi topo -m before deploying.
  2. PSU sizing. 2x 450W plus host overhead pushes a 1200W PSU close to 90% sustained. Insist on 1600W+ with quality 12V rails. The 4090 also has transient spikes to ~600W.
  3. Cooling. Two 4090s in a 4U chassis need active rear exhaust. Stack temps climb fast and the second card thermal-throttles before the first.
  4. NCCL version. Pin NCCL 2.20+ for best collective performance. Earlier versions have known regressions on PCIe-only topologies.
  5. vLLM --max-num-seqs. Tempting to push high but each sequence’s KV is duplicated across both TP shards. Start at 4 and grow only after measuring.
  6. Driver MIG warnings. Some driver branches log harmless MIG warnings at boot on consumer cards. Suppress with NVIDIA_LOG_LEVEL=warn in production.
  7. OOM during warmup. vLLM warmup tries the largest possible batch, can OOM at startup even when steady-state would fit. Use --enforce-eager initially to skip CUDA graph capture during debugging.

Verdict and decision criteria

Pair two 4090s when: you need 70B FP8 quality but cannot justify a 6000 Pro; you have small-model traffic that exceeds 1,800 t/s aggregate on a single card; you want the cheapest dedicated path to ~40 t/s decode on 70B.

Skip the pair when: you only need 70B AWQ INT4 (fits on one 4090); you need >50 t/s decode on 70B (jump to 6000 Pro or H100); you would mix card generations (use them as separate services instead); your host cannot deliver clean PCIe topology between the two slots.

Decision matrix:

NeedBest option
70B AWQ INT4 only1x 4090
70B FP8 quality, modest scale2x 4090 TP=2
70B FP8 high concurrency6000 Pro 96GB or H100 80GB
Mixtral 8x22B / Qwen 72B6000 Pro 96GB (2x 4090 OOMs)
Replica throughput on 8B2x 4090 (1.98x scaling)
NVLink-bridged trainingH100 SXM
Mixed model menuHybrid 4090 + 5060 Ti, services partitioned

Two cards, one workload

UK dedicated multi-GPU hosting with 4090 pairs available.

Order the RTX 4090 24GB

See also: Llama 70B INT4 benchmark, hybrid 4090 + 5060 Ti, upgrade to 6000 Pro, vs cloud H100, 70B INT4 VRAM, when to upgrade, vLLM setup, FP8 Llama deployment.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?