Hybrid RTX 4090 24GB + RTX 5060 Ti 16GB Pairing GIGAGPU

Heterogeneous GPU pairs are an underrated cost-optimisation in production AI infrastructure. A RTX 4090 24GB handles the LLM workload that needs the 24GB ceiling and the FP8 throughput, while a cheap RTX 5060 Ti 16GB takes image generation, Whisper transcription, embedding inference, classification, and any other workload that fits comfortably in 16GB. The two cards do not contend for VRAM, do not stall each other on PCIe, and the combined chassis costs less than a single 6000 Pro while covering a broader workload mix. This article walks through workload splitting, three routing patterns, the cost case versus alternatives, the configuration details and the anti-patterns – both options drawn from the UK GPU range.

Why mix cards rather than match them

Most production AI products do not run a single workload type. They mix LLM chat, image generation, audio transcription, embedding, classification, and increasingly multimodal vision-language inputs. A 4090 is overkill for SDXL or Whisper but underkill for Llama 70B FP8 without help. A second cheap card lets you specialise rather than letting workloads contend for shared VRAM.

Workload	Best card	VRAM needed	Why
Llama 3.1 70B AWQ INT4	4090	~22GB	Needs 24GB tier
Llama 3.1 8B FP8 chat	4090	~8GB + KV	Native FP8, 198 t/s
Mixtral 8x7B AWQ	4090	~25GB tight	Needs 24GB tier
Qwen 2.5 14B FP8	4090	~15GB	Could fit on 5060 Ti tight
SDXL image gen	5060 Ti	~8GB	£0.0006/image, plenty fast
Flux.1 Schnell	5060 Ti	~12GB	Fits, fast enough for batch
Whisper Large v3	5060 Ti	~6GB	16GB enough, 4090 wasted
BGE-M3 embeddings	5060 Ti	~3GB	Tiny model, batch-bound
BGE reranker	5060 Ti	~2GB	Trivially small
Llama Guard 3 (moderation)	5060 Ti	~6GB FP8	Safety classifier, low traffic

The pattern is “expensive card runs the expensive thing, cheap card runs everything else”. Both cards are near full utilisation but on different bottlenecks: the 4090 is bandwidth-bound on token decode, the 5060 Ti is compute-bound on diffusion steps. They do not contend on PCIe because each handles its own client traffic.

Workload split: who runs what

For a typical RAG-plus-chat-plus-image SaaS the practical split looks like this:

Card	Resident services	Approx VRAM	Approx peak utilisation
RTX 4090 24GB	vLLM Llama 3.1 70B AWQ INT4 (batch 4) OR Llama 3.1 8B FP8 (batch 16)	~22GB	85% SM, 95% memory
RTX 5060 Ti 16GB	SDXL + refiner, Whisper Large v3, BGE-M3 embedding, Llama Guard 3 FP8	~14GB	80% SM during diffusion

You can run several services on the 5060 Ti because they activate at different times. A typical traffic pattern is bursty image generation interleaved with steady embedding queries; the 5060 Ti handles the burst at peak SM utilisation and embeds during the diffusion idle. Whisper transcription is event-driven and rarely concurrent with image gen.

An alternative split for an inference-heavy product:

Card	Service	Notes
4090	Llama 70B AWQ + Llama 8B FP8 (separate vLLM instances on same card)	22GB + KV separation tight
5060 Ti	BGE-M3 + reranker + Llama Guard 3 + Whisper	~12GB total, no contention

Three routing patterns

Three common ways to wire this up, in increasing operational complexity:

1. Per-endpoint reverse proxy

Caddy or nginx routes /v1/chat/completions to the 4090 backend, /v1/images/generations to the 5060 Ti, /v1/embeddings to the 5060 Ti, /v1/audio/transcriptions to the 5060 Ti. Simplest pattern, ~1ms overhead, no extra dependencies.

# Caddyfile
api.example.com {
  reverse_proxy /v1/chat/* localhost:8001  # 4090 - vLLM
  reverse_proxy /v1/images/* localhost:8002  # 5060 Ti - SDXL
  reverse_proxy /v1/embeddings localhost:8003  # 5060 Ti - BGE
  reverse_proxy /v1/audio/* localhost:8004  # 5060 Ti - Whisper
}

2. LiteLLM gateway

A LiteLLM proxy in front of two backends gives you OpenAI-compatible routing, A/B traffic splitting, fallback on backend failure, and request-level cost tracking. Adds ~5ms latency.

3. Redis job queue

Submit all work to a Redis queue tagged with gpu_class; workers on each card pull only matching jobs. Best when traffic is bursty and you want graceful queueing during peaks. Adds 10-50ms queue overhead but lets you absorb 10x traffic spikes without dropping requests.

Pattern	Latency overhead	Operational complexity	Best for
Per-endpoint reverse proxy	~1ms	Low	Simple SaaS, predictable traffic
LiteLLM gateway	~5ms	Medium	Multi-tenant, cost tracking, fallback
Redis job queue	10-50ms	Higher	Bursty, async-friendly workloads

Cost vs single big card

Indicative monthly UK dedicated:

Setup	£/month	Capabilities	£/M tokens 8B FP8
1x 4090 24GB only	£575	LLM strong, image/audio share VRAM and contend	£0.039
1x 4090 + 1x 5060 Ti hybrid	£735	LLM full speed, image/audio dedicated, embeddings free	£0.039
1x 5090 32GB only	£900	LLM faster, but still shares for image/audio	£0.041
1x 6000 Pro 96GB	£2,200	Everything fits in one card, but expensive	£0.060
2x 4090 (TP=2 70B FP8)	£1,150	Bigger LLM, but no dedicated image card	£0.040
2x 5060 Ti (16GB each)	£300	Two image/audio cards, no big LLM	n/a

The hybrid sits at £735/month and gives you predictable throughput on both workload classes. Adding a 5060 Ti to a 4090 costs +28% but eliminates contention – which often matters more than peak throughput. For a multi-tenant SaaS where image generation is bursty and LLM is steady, the contention removal is the difference between meeting and missing SLA. See the multi-tenant SaaS guide for the operational pattern.

Configuration and CUDA pinning

# vLLM on the 4090
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct-AWQ \
  --quantization awq --kv-cache-dtype fp8 \
  --max-model-len 8192 --max-num-seqs 4 \
  --port 8001 --gpu-memory-utilization 0.92

# SDXL on the 5060 Ti
CUDA_VISIBLE_DEVICES=1 python -m sdnext.api \
  --port 8002 --model sdxl-1.0-base

# BGE-M3 embeddings on the 5060 Ti (different process, same device)
CUDA_VISIBLE_DEVICES=1 python -m text_embeddings_inference \
  --model-id BAAI/bge-m3 --port 8003

Pin services with CUDA_VISIBLE_DEVICES=0 for the 4090 (vLLM) and CUDA_VISIBLE_DEVICES=1 for the 5060 Ti (SDXL/Whisper/embeddings)
Confirm the host motherboard exposes both PCIe slots at full bandwidth – many consumer boards drop slot 2 to x4 when slot 1 is x16, which barely matters for the 5060 Ti workloads but is worth verifying
Power budget: 4090 (450W) + 5060 Ti (180W) = 630W GPU draw, plus host overhead. PSU should be 1000W+ with quality 12V rails
Use separate Docker compose services per card to isolate failures – a vLLM crash on the 4090 must not take down image generation
Monitor each card independently with DCGM exporter – the 5060 Ti will hit 85% util on SDXL while the 4090 sits at 30% on light chat traffic, which is correct
Set process-level resource limits with cgroups to prevent runaway services from contending for system memory

Anti-patterns to avoid

The hybrid pattern works because the cards are used independently. The most common ways to break it:

Tensor-parallel across mismatched cards. Never put a 4090 and a 5060 Ti in the same TP group. The 5060 Ti’s 448 GB/s bandwidth will stall the 4090’s 1,008 GB/s, and effective throughput drops below either card alone.
Pipeline-parallel across mismatched cards. Same problem, slightly less severe but still bad.
Splitting one model across both. vLLM with --tensor-parallel-size 2 and a 4090 + 5060 Ti will detect the mismatch and warn, but the result is dominated by the slower card.
Sharing one model with both processes. Two vLLM instances of the same Llama 8B FP8, one on each card, will work but you double-pay VRAM. If both cards can run the model, run different services on them instead.
Letting Docker default to --gpus all. Both containers see both cards, fight over device 0, and CUDA OOM intermittently. Always pin with device_ids.
Embedding service on the 4090 “to save a card”. Defeats the point. Embeddings are tiny and fast on the 5060 Ti; loading them on the 4090 just steals KV from the LLM.

Production gotchas

Driver branch consistency. Both cards must use the same NVIDIA driver branch. Mixing driver 535 (4090 stable) with driver 555 (5060 Ti needs Blackwell) will crash one of them.
nvidia-container-toolkit pinning. Container runtime must enumerate both devices. Verify with docker run --rm --gpus all nvidia/cuda:12.4-base nvidia-smi.
SDXL VRAM creep. Loading SDXL + refiner + ControlNet + IP-Adapter on the 5060 Ti can exceed 14GB. Either disable refiner or move IP-Adapter inference to a separate process.
Whisper batch size. Default Whisper Large v3 batches 16 segments which uses ~10GB. If running concurrent SDXL, drop to batch 4.
Cooling asymmetry. 4090 (450W) and 5060 Ti (180W) thermal profiles are very different. Position the 5060 Ti as the lower card so its lower exhaust does not preheat the 4090’s intake.
PCIe Gen5 vs Gen4 lanes. 5060 Ti is Gen5 capable but on a Gen4 host slot it negotiates Gen4 fine. Do not force Gen5 in BIOS – some boards become unstable.
Power surge on cold boot. Both cards drawing peak transient at the same boot moment can trip cheap PSUs. Stagger service starts by 10s if you see boot OOMs.

When the hybrid breaks down

Skip the hybrid if your image traffic is so heavy that the 5060 Ti saturates – at that point a second 4090 (more SDXL throughput per card) makes more sense, or upgrade the 5060 Ti to a 5070 Ti. Skip it if your LLM workload demands 70B FP8 quality – then a 6000 Pro or 2×4090 TP is the right answer (covered in the multi-card pairing post). Skip it if you need NVLink bridged tenancy – hybrid pairs cannot be NVLinked.

The hybrid is the sweet spot when LLM dominates but image generation, embeddings, audio, or moderation are non-zero and you want them not to compete for VRAM with the LLM.

Verdict and decision criteria

Choose the hybrid when: you serve a SaaS with mixed workload classes (chat + image + embeddings); you want predictable throughput on each class; total monthly budget is in the £700-900 range; image and embedding traffic is steady-but-modest.

Skip the hybrid when: single-workload product (pure LLM chat or pure image generation); LLM workload needs 70B FP8 quality (use 2x 4090 or 6000 Pro); image traffic is so heavy it needs a peer card (use 2x 4090); strict tenancy isolation required (use H100 with MIG).

Scenario	Best setup
RAG SaaS with chat + image + embed	4090 + 5060 Ti hybrid
Pure chat with bursts	1x 4090 (or 5090)
Image-generation product	2x 5060 Ti or 1x 4090
70B FP8 production	2x 4090 TP=2 or 1x 6000 Pro
Multi-tenant with isolation SLA	1x H100 with MIG
Voice assistant with realtime LLM	4090 + 5060 Ti hybrid

Mix and match for the right workload

Pair an Ada AD102 with a Blackwell entry card for workload-specialised UK dedicated hosting.

Order the RTX 4090 24GB

Hybrid RTX 4090 24GB + RTX 5060 Ti 16GB Pairing

Contents

Why mix cards rather than match them

Workload split: who runs what

Three routing patterns

1. Per-endpoint reverse proxy

2. LiteLLM gateway

3. Redis job queue

Cost vs single big card

Configuration and CUDA pinning

Anti-patterns to avoid

Production gotchas

When the hybrid breaks down

Verdict and decision criteria

Mix and match for the right workload

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Hybrid RTX 4090 24GB + RTX 5060 Ti 16GB Pairing

Contents

Why mix cards rather than match them

Workload split: who runs what

Three routing patterns

1. Per-endpoint reverse proxy

2. LiteLLM gateway

3. Redis job queue

Cost vs single big card

Configuration and CUDA pinning

Anti-patterns to avoid

Production gotchas

When the hybrid breaks down

Verdict and decision criteria

Mix and match for the right workload

Need a Dedicated GPU Server?

gigagpu

Related Articles

Best Modal Alternatives for Serverless GPU

Self-Hosted vs Paperspace

Google Vertex Data Residency Issues for UK

Best Vast.ai Alternatives for Production AI in 2026

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?