Heterogeneous GPU pairs are an underrated cost-optimisation in production AI infrastructure. A RTX 4090 24GB handles the LLM workload that needs the 24GB ceiling and the FP8 throughput, while a cheap RTX 5060 Ti 16GB takes image generation, Whisper transcription, embedding inference, classification, and any other workload that fits comfortably in 16GB. The two cards do not contend for VRAM, do not stall each other on PCIe, and the combined chassis costs less than a single 6000 Pro while covering a broader workload mix. This article walks through workload splitting, three routing patterns, the cost case versus alternatives, the configuration details and the anti-patterns – both options drawn from the UK GPU range.
Contents
- Why mix cards rather than match them
- Workload split: who runs what
- Three routing patterns
- Cost vs single big card
- Configuration and CUDA pinning
- Anti-patterns to avoid
- When the hybrid breaks down
- Verdict and decision criteria
Why mix cards rather than match them
Most production AI products do not run a single workload type. They mix LLM chat, image generation, audio transcription, embedding, classification, and increasingly multimodal vision-language inputs. A 4090 is overkill for SDXL or Whisper but underkill for Llama 70B FP8 without help. A second cheap card lets you specialise rather than letting workloads contend for shared VRAM.
| Workload | Best card | VRAM needed | Why |
|---|---|---|---|
| Llama 3.1 70B AWQ INT4 | 4090 | ~22GB | Needs 24GB tier |
| Llama 3.1 8B FP8 chat | 4090 | ~8GB + KV | Native FP8, 198 t/s |
| Mixtral 8x7B AWQ | 4090 | ~25GB tight | Needs 24GB tier |
| Qwen 2.5 14B FP8 | 4090 | ~15GB | Could fit on 5060 Ti tight |
| SDXL image gen | 5060 Ti | ~8GB | £0.0006/image, plenty fast |
| Flux.1 Schnell | 5060 Ti | ~12GB | Fits, fast enough for batch |
| Whisper Large v3 | 5060 Ti | ~6GB | 16GB enough, 4090 wasted |
| BGE-M3 embeddings | 5060 Ti | ~3GB | Tiny model, batch-bound |
| BGE reranker | 5060 Ti | ~2GB | Trivially small |
| Llama Guard 3 (moderation) | 5060 Ti | ~6GB FP8 | Safety classifier, low traffic |
The pattern is “expensive card runs the expensive thing, cheap card runs everything else”. Both cards are near full utilisation but on different bottlenecks: the 4090 is bandwidth-bound on token decode, the 5060 Ti is compute-bound on diffusion steps. They do not contend on PCIe because each handles its own client traffic.
Workload split: who runs what
For a typical RAG-plus-chat-plus-image SaaS the practical split looks like this:
| Card | Resident services | Approx VRAM | Approx peak utilisation |
|---|---|---|---|
| RTX 4090 24GB | vLLM Llama 3.1 70B AWQ INT4 (batch 4) OR Llama 3.1 8B FP8 (batch 16) | ~22GB | 85% SM, 95% memory |
| RTX 5060 Ti 16GB | SDXL + refiner, Whisper Large v3, BGE-M3 embedding, Llama Guard 3 FP8 | ~14GB | 80% SM during diffusion |
You can run several services on the 5060 Ti because they activate at different times. A typical traffic pattern is bursty image generation interleaved with steady embedding queries; the 5060 Ti handles the burst at peak SM utilisation and embeds during the diffusion idle. Whisper transcription is event-driven and rarely concurrent with image gen.
An alternative split for an inference-heavy product:
| Card | Service | Notes |
|---|---|---|
| 4090 | Llama 70B AWQ + Llama 8B FP8 (separate vLLM instances on same card) | 22GB + KV separation tight |
| 5060 Ti | BGE-M3 + reranker + Llama Guard 3 + Whisper | ~12GB total, no contention |
Three routing patterns
Three common ways to wire this up, in increasing operational complexity:
1. Per-endpoint reverse proxy
Caddy or nginx routes /v1/chat/completions to the 4090 backend, /v1/images/generations to the 5060 Ti, /v1/embeddings to the 5060 Ti, /v1/audio/transcriptions to the 5060 Ti. Simplest pattern, ~1ms overhead, no extra dependencies.
# Caddyfile
api.example.com {
reverse_proxy /v1/chat/* localhost:8001 # 4090 - vLLM
reverse_proxy /v1/images/* localhost:8002 # 5060 Ti - SDXL
reverse_proxy /v1/embeddings localhost:8003 # 5060 Ti - BGE
reverse_proxy /v1/audio/* localhost:8004 # 5060 Ti - Whisper
}
2. LiteLLM gateway
A LiteLLM proxy in front of two backends gives you OpenAI-compatible routing, A/B traffic splitting, fallback on backend failure, and request-level cost tracking. Adds ~5ms latency.
3. Redis job queue
Submit all work to a Redis queue tagged with gpu_class; workers on each card pull only matching jobs. Best when traffic is bursty and you want graceful queueing during peaks. Adds 10-50ms queue overhead but lets you absorb 10x traffic spikes without dropping requests.
| Pattern | Latency overhead | Operational complexity | Best for |
|---|---|---|---|
| Per-endpoint reverse proxy | ~1ms | Low | Simple SaaS, predictable traffic |
| LiteLLM gateway | ~5ms | Medium | Multi-tenant, cost tracking, fallback |
| Redis job queue | 10-50ms | Higher | Bursty, async-friendly workloads |
Cost vs single big card
Indicative monthly UK dedicated:
| Setup | £/month | Capabilities | £/M tokens 8B FP8 |
|---|---|---|---|
| 1x 4090 24GB only | £575 | LLM strong, image/audio share VRAM and contend | £0.039 |
| 1x 4090 + 1x 5060 Ti hybrid | £735 | LLM full speed, image/audio dedicated, embeddings free | £0.039 |
| 1x 5090 32GB only | £900 | LLM faster, but still shares for image/audio | £0.041 |
| 1x 6000 Pro 96GB | £2,200 | Everything fits in one card, but expensive | £0.060 |
| 2x 4090 (TP=2 70B FP8) | £1,150 | Bigger LLM, but no dedicated image card | £0.040 |
| 2x 5060 Ti (16GB each) | £300 | Two image/audio cards, no big LLM | n/a |
The hybrid sits at £735/month and gives you predictable throughput on both workload classes. Adding a 5060 Ti to a 4090 costs +28% but eliminates contention – which often matters more than peak throughput. For a multi-tenant SaaS where image generation is bursty and LLM is steady, the contention removal is the difference between meeting and missing SLA. See the multi-tenant SaaS guide for the operational pattern.
Configuration and CUDA pinning
# vLLM on the 4090
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct-AWQ \
--quantization awq --kv-cache-dtype fp8 \
--max-model-len 8192 --max-num-seqs 4 \
--port 8001 --gpu-memory-utilization 0.92
# SDXL on the 5060 Ti
CUDA_VISIBLE_DEVICES=1 python -m sdnext.api \
--port 8002 --model sdxl-1.0-base
# BGE-M3 embeddings on the 5060 Ti (different process, same device)
CUDA_VISIBLE_DEVICES=1 python -m text_embeddings_inference \
--model-id BAAI/bge-m3 --port 8003
- Pin services with
CUDA_VISIBLE_DEVICES=0for the 4090 (vLLM) andCUDA_VISIBLE_DEVICES=1for the 5060 Ti (SDXL/Whisper/embeddings) - Confirm the host motherboard exposes both PCIe slots at full bandwidth – many consumer boards drop slot 2 to x4 when slot 1 is x16, which barely matters for the 5060 Ti workloads but is worth verifying
- Power budget: 4090 (450W) + 5060 Ti (180W) = 630W GPU draw, plus host overhead. PSU should be 1000W+ with quality 12V rails
- Use separate Docker compose services per card to isolate failures – a vLLM crash on the 4090 must not take down image generation
- Monitor each card independently with DCGM exporter – the 5060 Ti will hit 85% util on SDXL while the 4090 sits at 30% on light chat traffic, which is correct
- Set process-level resource limits with cgroups to prevent runaway services from contending for system memory
Anti-patterns to avoid
The hybrid pattern works because the cards are used independently. The most common ways to break it:
- Tensor-parallel across mismatched cards. Never put a 4090 and a 5060 Ti in the same TP group. The 5060 Ti’s 448 GB/s bandwidth will stall the 4090’s 1,008 GB/s, and effective throughput drops below either card alone.
- Pipeline-parallel across mismatched cards. Same problem, slightly less severe but still bad.
- Splitting one model across both. vLLM with
--tensor-parallel-size 2and a 4090 + 5060 Ti will detect the mismatch and warn, but the result is dominated by the slower card. - Sharing one model with both processes. Two vLLM instances of the same Llama 8B FP8, one on each card, will work but you double-pay VRAM. If both cards can run the model, run different services on them instead.
- Letting Docker default to
--gpus all. Both containers see both cards, fight over device 0, and CUDA OOM intermittently. Always pin withdevice_ids. - Embedding service on the 4090 “to save a card”. Defeats the point. Embeddings are tiny and fast on the 5060 Ti; loading them on the 4090 just steals KV from the LLM.
Production gotchas
- Driver branch consistency. Both cards must use the same NVIDIA driver branch. Mixing driver 535 (4090 stable) with driver 555 (5060 Ti needs Blackwell) will crash one of them.
- nvidia-container-toolkit pinning. Container runtime must enumerate both devices. Verify with
docker run --rm --gpus all nvidia/cuda:12.4-base nvidia-smi. - SDXL VRAM creep. Loading SDXL + refiner + ControlNet + IP-Adapter on the 5060 Ti can exceed 14GB. Either disable refiner or move IP-Adapter inference to a separate process.
- Whisper batch size. Default Whisper Large v3 batches 16 segments which uses ~10GB. If running concurrent SDXL, drop to batch 4.
- Cooling asymmetry. 4090 (450W) and 5060 Ti (180W) thermal profiles are very different. Position the 5060 Ti as the lower card so its lower exhaust does not preheat the 4090’s intake.
- PCIe Gen5 vs Gen4 lanes. 5060 Ti is Gen5 capable but on a Gen4 host slot it negotiates Gen4 fine. Do not force Gen5 in BIOS – some boards become unstable.
- Power surge on cold boot. Both cards drawing peak transient at the same boot moment can trip cheap PSUs. Stagger service starts by 10s if you see boot OOMs.
When the hybrid breaks down
Skip the hybrid if your image traffic is so heavy that the 5060 Ti saturates – at that point a second 4090 (more SDXL throughput per card) makes more sense, or upgrade the 5060 Ti to a 5070 Ti. Skip it if your LLM workload demands 70B FP8 quality – then a 6000 Pro or 2×4090 TP is the right answer (covered in the multi-card pairing post). Skip it if you need NVLink bridged tenancy – hybrid pairs cannot be NVLinked.
The hybrid is the sweet spot when LLM dominates but image generation, embeddings, audio, or moderation are non-zero and you want them not to compete for VRAM with the LLM.
Verdict and decision criteria
Choose the hybrid when: you serve a SaaS with mixed workload classes (chat + image + embeddings); you want predictable throughput on each class; total monthly budget is in the £700-900 range; image and embedding traffic is steady-but-modest.
Skip the hybrid when: single-workload product (pure LLM chat or pure image generation); LLM workload needs 70B FP8 quality (use 2x 4090 or 6000 Pro); image traffic is so heavy it needs a peer card (use 2x 4090); strict tenancy isolation required (use H100 with MIG).
| Scenario | Best setup |
|---|---|
| RAG SaaS with chat + image + embed | 4090 + 5060 Ti hybrid |
| Pure chat with bursts | 1x 4090 (or 5090) |
| Image-generation product | 2x 5060 Ti or 1x 4090 |
| 70B FP8 production | 2x 4090 TP=2 or 1x 6000 Pro |
| Multi-tenant with isolation SLA | 1x H100 with MIG |
| Voice assistant with realtime LLM | 4090 + 5060 Ti hybrid |
Mix and match for the right workload
Pair an Ada AD102 with a Blackwell entry card for workload-specialised UK dedicated hosting.
Order the RTX 4090 24GBSee also: 2x 4090 pairing, 5060 Ti pairing, 4090 vs 5060 Ti decision, spec breakdown, when to upgrade, multi-tenant SaaS, SaaS RAG, vs 5060 Ti spec.