2x RTX 4090 24GB Pairing for Tensor Parallel Inference GIGAGPU

A single RTX 4090 24GB cannot run Llama 3.1 70B in FP8 – the model needs ~38GB. Two 4090s with vLLM tensor parallelism can. This article walks through the pairing in detail: how the model splits across both cards, what throughput you actually get, why the scaling caps near 1.6-1.7x rather than 2x, the all-reduce overhead from the absent NVLink, the vLLM launch commands, the heterogeneous-pair anti-pattern, and when 2x 4090 is the right answer versus a single 5090, 6000 Pro or H100. Both cards available via the UK dedicated GPU range.

Why pair 4090s

Three legitimate reasons to put a second 4090 in the chassis:

Run Llama 70B FP8 quality on consumer-derived hardware. Single 4090 OOMs. Two cards split ~35GB of weights cleanly across TP=2.
Double aggregate throughput on small-model workloads where you have spare traffic to absorb (replica mode, near-linear scaling).
Cheaper than one 6000 Pro for ~equivalent VRAM – 2×4090 gives 48GB at ~£1,150/mo versus 96GB 6000 Pro at ~£2,200/mo for similar 70B FP8 throughput.

If none of these apply – your model fits on one card and you do not need replica throughput – skip the pairing entirely and either stay on a single 4090 (covered in the FP8 deployment guide) or upgrade to a 5090 (see the 5090 upgrade post).

Tensor parallel for 70B FP8

vLLM with --tensor-parallel-size 2 shards each transformer layer’s weight matrices column-wise across two devices. Each card holds half the attention heads and half the MLP intermediate dimension. After every transformer block, an all-reduce collective sums the partial activations across both cards. For Llama 70B FP8 the per-card footprint becomes:

Component	Per-card VRAM	Notes
Model weights (FP8 split)	~17.5GB	35GB of weights / 2 cards
KV cache (FP16, 4k context, batch 4)	~3.5GB	FP8 KV halves this
Activation buffers	~1.5GB	Hidden states, residual connections
NCCL communication buffers	~0.8GB	All-reduce staging
CUDA reserved + framework	~1GB	cuBLAS, cuDNN workspace
Per-card total	~24.3GB	At the absolute edge

That’s literally at the 24GB limit. Practical deployments use --kv-cache-dtype fp8 to halve KV memory, cap context with --max-model-len 16384, and limit concurrency with --max-num-seqs 4 to keep headroom. The 70B INT4 deployment guide uses AWQ instead and avoids the squeeze entirely.

Pipeline parallel as an alternative

Pipeline parallel splits the model layer-wise instead of tensor-wise. Card 0 holds layers 0-39, card 1 holds layers 40-79. Tokens flow through card 0, get handed to card 1, finish, and stream back. The handoff is a single tensor copy per micro-batch rather than an all-reduce per layer, so PP is dramatically less bandwidth-sensitive than TP – useful when PCIe is the bottleneck.

Mode	Comms per token	70B FP8 t/s	Aggregate batch 8	Best for
TP=2 (vLLM default)	~80 all-reduces (one per layer)	38-42 t/s decode	~165 t/s	Throughput-bound, NVLink-equipped hosts
PP=2	~2 tensor copies	30-32 t/s decode	~110 t/s	PCIe-only, sequential workloads
Replica (TP=1, two instances)	None	n/a (single-card model)	~390 t/s for 8B	Small models, separate traffic

For 70B FP8 production quality on dual 4090 the right answer is usually TP=2 with FP8 KV – the all-reduce overhead is real but you get near-best per-token latency. PP is a fallback when topology is bad.

The 1.6x scaling cap explained

Two 4090s do not give 2x throughput on a TP=2 70B workload. The scaling caps near 1.6-1.74x and the lost ~0.3x is real silicon time. Here is where it goes:

All-reduce after every layer. Llama 70B has 80 transformer blocks. TP=2 issues an all-reduce after attention and another after the MLP. That is 160 collectives per forward pass.
PCIe Gen4 x16 peer bandwidth ~32 GB/s. NVLink on H100 is 900 GB/s peer. The 4090 has zero NVLink lanes, so all-reduce traffic crosses the CPU root complex and PCIe.
Activation tensor size. For 70B at hidden dim 8192, each all-reduce moves ~32 KB per token per layer. At 40 t/s that is ~25 GB/s of comms – 80% of the PCIe Gen4 budget.
Synchronisation stalls. If one card finishes its half a microsecond ahead of the other, both wait for the all-reduce barrier. Variance compounds over 160 collectives.

Workload	1x 4090 t/s	2x 4090 t/s	Scaling	Why
Llama 8B FP8 batch 1 (replica mode)	198	396	2.00x	No comms – independent traffic
Llama 8B FP8 aggregate batch 32 (replica)	1,100	2,180	1.98x	Near-linear
Llama 70B FP8 batch 1 (TP=2)	OOM	~40	n/a (only TP works)	Splits weights across cards
Llama 70B AWQ INT4 batch 1 (TP=2)	22	~38	1.73x	27% comms tax
Llama 70B AWQ INT4 concurrency 8 (TP=2)	~95 aggr	~165 aggr	1.74x	Comms tax compounds with batch
Llama 70B FP8 concurrency 4 (TP=2)	OOM	~155 aggr	n/a	~38 t/s per stream

For 70B FP8 production quality, dual 4090 lands at roughly the same throughput as a single 5090 32GB but with more VRAM headroom for batch and KV. It is also roughly the same throughput as a single 6000 Pro 96GB at half the price. Compare to single 5090 for the alternative.

vLLM launch commands

The standard tensor-parallel launch for Llama 70B FP8 on dual 4090:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 --max-num-seqs 4 \
  --gpu-memory-utilization 0.92

Replica mode for 8B FP8 doubling throughput on independent traffic – launch two single-card instances on different ports:

# Card 0
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --port 8001 --gpu-memory-utilization 0.92 &

# Card 1
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --port 8002 --gpu-memory-utilization 0.92 &

# Round-robin via Caddy or nginx upstream

Pipeline parallel for the bandwidth-constrained case:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct-FP8 \
  --pipeline-parallel-size 2 --tensor-parallel-size 1 \
  --max-model-len 16384 --max-num-seqs 8 \
  --gpu-memory-utilization 0.92

NCCL environment knobs that matter on PCIe-only hosts: NCCL_P2P_LEVEL=PHB forces peer-to-peer over the host bridge, NCCL_DEBUG=INFO shows which transport is selected. Verify topology with nvidia-smi topo -m – you want PIX or PHB between the two GPUs, not SYS (which routes via QPI/UPI between sockets and slows by ~30%). Full setup is in the vLLM setup guide.

Heterogeneous cards: do not tensor-parallel

If you mix a 4090 with a 5060 Ti, 5090, or 3090 in the same chassis, never put them in the same TP group. Tensor parallel synchronises after every layer, so the slower card stalls the faster one – effective throughput drops to the slower card multiplied by two, which is worse than running on the faster card alone.

Configuration	Throughput	Verdict
2x 4090 TP=2	~38 t/s 70B AWQ	OK – matched cards
1x 4090 + 1x 5060 Ti TP=2	~22 t/s, 5060 Ti stalls 4090	Worse than single 4090
1x 4090 + 1x 5090 TP=2	~30 t/s, 4090 stalls 5090	Worse than single 5090
1x 4090 + 1x 5060 Ti separate services	Each at full speed independently	Correct pattern

The right pattern for heterogeneous pairs is service partitioning: LLM on the 4090, image generation or embeddings on the 5060 Ti. The hybrid pairing post walks through the routing patterns.

Cost vs alternatives

Setup	VRAM	£/mo	Llama 70B FP8	Llama 8B aggregate t/s	£/M tokens 8B
1x 4090	24GB	£575	OOM	~1,100	£0.039
2x 4090 (TP=2 70B FP8)	48GB	£1,150	~38-42 t/s	n/a (TP)	n/a
2x 4090 (replica 8B)	48GB	£1,150	n/a	~2,180	£0.040
1x 5090	32GB	£900	~30 t/s (tight, may OOM)	~1,700	£0.041
1x 6000 Pro	96GB	£2,200	~28-30 t/s comfortable	~1,400	£0.060
1x H100 80GB	80GB	£2,800	~70 t/s, NVLink	~5,000	£0.038

2x 4090 is the cheapest way to run 70B FP8 on dedicated hardware – cheaper than a 6000 Pro and dramatically cheaper than an H100. Throughput on 70B is comparable to the 6000 Pro, just split across two PCIe slots with a 27% comms tax. For replica-mode small-model traffic, 2x 4090 is also the best £/M-tokens in the table.

Production gotchas

PCIe topology matters. Two cards on the same root complex (PIX/PHB) get up to 32 GB/s peer; cards on different sockets route via QPI/UPI and slow by ~30%. Check nvidia-smi topo -m before deploying.
PSU sizing. 2x 450W plus host overhead pushes a 1200W PSU close to 90% sustained. Insist on 1600W+ with quality 12V rails. The 4090 also has transient spikes to ~600W.
Cooling. Two 4090s in a 4U chassis need active rear exhaust. Stack temps climb fast and the second card thermal-throttles before the first.
NCCL version. Pin NCCL 2.20+ for best collective performance. Earlier versions have known regressions on PCIe-only topologies.
vLLM --max-num-seqs. Tempting to push high but each sequence’s KV is duplicated across both TP shards. Start at 4 and grow only after measuring.
Driver MIG warnings. Some driver branches log harmless MIG warnings at boot on consumer cards. Suppress with NVIDIA_LOG_LEVEL=warn in production.
OOM during warmup. vLLM warmup tries the largest possible batch, can OOM at startup even when steady-state would fit. Use --enforce-eager initially to skip CUDA graph capture during debugging.

Verdict and decision criteria

Pair two 4090s when: you need 70B FP8 quality but cannot justify a 6000 Pro; you have small-model traffic that exceeds 1,800 t/s aggregate on a single card; you want the cheapest dedicated path to ~40 t/s decode on 70B.

Skip the pair when: you only need 70B AWQ INT4 (fits on one 4090); you need >50 t/s decode on 70B (jump to 6000 Pro or H100); you would mix card generations (use them as separate services instead); your host cannot deliver clean PCIe topology between the two slots.

Decision matrix:

Need	Best option
70B AWQ INT4 only	1x 4090
70B FP8 quality, modest scale	2x 4090 TP=2
70B FP8 high concurrency	6000 Pro 96GB or H100 80GB
Mixtral 8x22B / Qwen 72B	6000 Pro 96GB (2x 4090 OOMs)
Replica throughput on 8B	2x 4090 (1.98x scaling)
NVLink-bridged training	H100 SXM
Mixed model menu	Hybrid 4090 + 5060 Ti, services partitioned

Two cards, one workload

UK dedicated multi-GPU hosting with 4090 pairs available.

Order the RTX 4090 24GB

2x RTX 4090 24GB Pairing for Tensor Parallel Inference

Contents

Why pair 4090s

Tensor parallel for 70B FP8

Pipeline parallel as an alternative

The 1.6x scaling cap explained

vLLM launch commands

Heterogeneous cards: do not tensor-parallel

Cost vs alternatives

Production gotchas

Verdict and decision criteria

Two cards, one workload

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

2x RTX 4090 24GB Pairing for Tensor Parallel Inference

Contents

Why pair 4090s

Tensor parallel for 70B FP8

Pipeline parallel as an alternative

The 1.6x scaling cap explained

vLLM launch commands

Heterogeneous cards: do not tensor-parallel

Cost vs alternatives

Production gotchas

Verdict and decision criteria

Two cards, one workload

Need a Dedicated GPU Server?

gigagpu

Related Articles

Best Vast.ai Alternatives for Production AI in 2026

Best Groq Alternatives for Fast LLM Inference

RunPod Alternatives: Dedicated GPU Hosting Compared

RTX 5060 Ti 16GB or RTX 3090 – Decision

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?