If you are building an AI-first MVP and your runway is in months not years, the RTX 4090 24GB is the most pragmatic backend you can rent. One card holds the chat model, the embedding model, a reranker and an image generator with room to spare for a Whisper-based voice channel, and the predictable monthly bill on UK GPU hosting beats the 30-day variance of pay-per-token APIs at almost any scale of seriousness. This guide describes the named workload we have shipped many times: a “200-400 MAU AI SaaS MVP” with chat, RAG, image generation and optional voice on one card.
Contents
- Why one card is enough at MVP scale
- The all-in-one stack
- VRAM budget and co-location
- Latency budget breakdown
- User capacity table
- Cost vs APIs and cloud
- Scaling triggers
- Pitfalls we have seen
Why one card is enough at MVP scale
An MVP at 200-400 MAU generates roughly 8-22 active sessions during the busy hour, with each session producing 2-5 LLM turns and maybe one image generation. That sums to ~120 LLM turns per minute at peak (~30,000 output tokens) and ~6 image generations. A single 4090 running Llama 3 8B FP8 sustains roughly 1,100 t/s aggregate at batch 32 and renders an SDXL 1024 image in 2 seconds; both numbers leave 80%+ headroom at peak MVP load. The card is rarely the bottleneck before product-market fit; the team’s iteration speed is.
The architecture below is deliberately boring: one VPS-style GPU host on UK dedicated hosting, OpenAI-compatible HTTP endpoints for everything, no Kubernetes. You can add complexity when you raise a Series A and the load justifies it.
The all-in-one stack
The reference MVP stack we recommend on a single 4090, with the role each component plays:
| Component | Model | Server | VRAM resident | Notes |
|---|---|---|---|---|
| Chat / agent LLM | Llama 3.1 8B FP8 | vLLM 0.6.4 | 10 GB weights + 4 GB KV | OpenAI-compatible |
| Embeddings | BGE-large-en-v1.5 | text-embeddings-inference | 1.4 GB | 1024 dim |
| Reranker | BGE-reranker-large | TEI | 1.4 GB | For top-50 to top-6 |
| Image generator | SDXL base + refiner | diffusers | ~7 GB on demand | 2 s per 1024 image |
| Speech-to-text | Whisper Turbo INT8 | faster-whisper | 1.7 GB on demand | 80x RT |
| Vector store | Qdrant 1.10 | CPU + NVMe | 0 (host RAM) | Up to 10M chunks comfortable |
Total simultaneous footprint with chat + embeddings + reranker resident and image/audio invoked on demand: 18-21 GB peak. That leaves 3-5 GB safety margin on a 24 GB card.
VRAM budget and co-location
The trick is to keep the always-warm models small (LLM + embedding + reranker, ~13 GB) and treat image and audio as on-demand workers that swap into the remaining 11 GB. With --gpu-memory-utilization 0.55 on vLLM the LLM honours its budget and the image worker can grab 7 GB transiently to do an SDXL run in 2 seconds, then release.
# vLLM launch with capped memory share
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 16 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.55
Image and Whisper run as separate processes with their own CUDA context, scheduled by a tiny Python queue (Redis + RQ works fine at MVP scale). The risk is concurrent OOM if SDXL and Whisper both fire at peak; we cover that in the gotchas section. For broader stack context see the SaaS RAG guide and the image generation studio.
Latency budget breakdown
For a typical RAG + chat turn on the MVP stack, the perceived latency budget breaks down as follows:
| Stage | Median | P95 |
|---|---|---|
| Embed user query (BGE-base) | 2 ms | 5 ms |
| Qdrant top-100 (3M chunks) | 18 ms | 42 ms |
| Rerank top-100 to top-6 (BGE-rr-large) | 42 ms | 68 ms |
| LLM prefill (1500 tokens, prefix-cached) | 140 ms | 220 ms |
| LLM decode (250 tokens at 200 t/s) | 1,250 ms | 1,650 ms |
| Total streamed (TTFT) | ~210 ms | ~340 ms |
| Total full reply | ~1.5 s | ~2.0 s |
TTFT under 350 ms p95 is comfortably “feels instant”; the streaming reply lands inside the 2-second budget that mainstream consumer chat sets. SDXL image generation is a parallel path and adds 2 s wall-clock when invoked.
User capacity table
| Workload | Throughput | 200 MAU? | 400 MAU? | Notes |
|---|---|---|---|---|
| LLM chat (Llama 3 8B FP8) | 1,100 t/s aggregate | Yes (5% of card) | Yes (10%) | 30 active envelope |
| Embeddings (BGE-large) | 5,200 texts/s | Yes | Yes | Index + query |
| Reranker | 42 ms top-100 | Yes | Yes | Sub-100 ms |
| SDXL 1024 30-step | 2.0 s per image | ~1,800/hr capacity | ~1,800/hr | On demand |
| Whisper Turbo INT8 | 80x RT | ~60 concurrent | ~60 concurrent | If voice channel enabled |
200 MAU at typical SaaS distribution generates roughly 8-12 active chat sessions in the busy hour, well within the 30-active-user envelope of Llama 3 8B FP8 on a 4090. 400 MAU pushes you to 16-22 active, still inside the SLA, especially with prefix caching enabled. Numbers cross-reference the concurrent users benchmark.
Cost vs APIs and cloud
A 4090 host runs in the low hundreds of GBP per month on UK dedicated hosting (see the monthly hosting cost piece). The same 200 MAU on OpenAI’s GPT-4o-mini at typical chat consumption (1-2M output tokens per active user per month) plus SDXL replacement on Replicate would land in the high hundreds. The break-even is well before product-market fit. Detail in 4090 vs OpenAI API cost and 4090 vs Anthropic API cost.
| Approach | 200 MAU monthly | 400 MAU monthly | Variance risk |
|---|---|---|---|
| 4090 dedicated | ~250 GBP fixed | ~250 GBP fixed | None |
| OpenAI GPT-4o-mini (chat only) | ~150-300 GBP | ~300-600 GBP | Spiky users break budget |
| OpenAI + Replicate SDXL | ~300-500 GBP | ~600-1,000 GBP | Image-heavy MAU explodes |
| Cloud H100 on demand | ~2,000 GBP | ~2,000 GBP | Wildly over-provisioned |
Developer experience
vLLM exposes an OpenAI-compatible HTTP server so your existing OpenAI SDK code runs unchanged with a base URL swap. TEI exposes embeddings and reranker the same way. This means your MVP keeps the option to fall back to OpenAI for spillover or for capabilities not yet replicated locally, which is what most pragmatic teams do for the first few months. The vLLM setup guide and the FP8 Llama deployment piece have copy-paste configs.
Scaling triggers
Concrete triggers for adding capacity, in order:
- Active concurrency > 25 sustained. You are halfway to the SLA ceiling on Llama 3 8B FP8. Add a second 4090 behind a least-loaded balancer; vLLM scales linearly across replicas.
- Image generation queue depth > 10 sustained. SDXL is the most contention-prone workload on the shared card. Move SDXL to a dedicated 5060 Ti or second 4090.
- Quality complaints on ambiguous queries. Upgrade the LLM to Qwen 14B AWQ for higher answer quality at half the throughput. You can co-host both and route by classifier.
- Voice channel enabled. Whisper + LLM + TTS stresses the card; consider a dedicated voice stack on a second box, sized using the voice assistant guide.
- RAG corpus > 30M chunks. Move Qdrant to a dedicated CPU box with NVMe; the GPU host should not run a 60 GB vector index in RAM.
Pitfalls we have seen
- SDXL OOM under concurrent load. If two SDXL requests fire while the LLM holds 14 GB resident, the second one OOMs. Serialise SDXL through a single-worker queue at MVP scale.
- vLLM allocates max KV at startup. Setting
--max-model-len 65536when most prompts are 2k wastes 6 GB of KV cache memory. Pin to your real maximum. - BGE-large is overkill for many MVPs. BGE-base at 768 dim is faster, smaller and rerankable; switch back only if recall is genuinely insufficient.
- Forgetting prefix caching. Without
--enable-prefix-caching, every chat request reprocesses the system prompt. Enabling it doubles effective concurrency for free. - Hot-reloading the model in dev. Save 8 minutes per iteration by leaving vLLM running and swapping the OpenAI base URL in your client; do not restart the server unless you change the model file.
- Confusing TPS and t/s/user. Aggregate 1,100 t/s does not mean every user gets 1,100 t/s; at batch 30 each user gets ~36 t/s. Plan UX for the per-user rate.
- Treating the 4090 as a web server. Run nginx and the FastAPI gateway on a separate small VM; do not co-host network ingress on the GPU box.
Verdict
For the first 12-18 months of an AI startup, a single 4090 is the right backend. It runs the entire stack at MVP scale, the cost is predictable, the operational surface is small, and it scales horizontally when traffic justifies it. The decision tree only changes when you cross 5,000 MAU or need flagship-quality responses on every turn; at that point see the 4090 vs 5090 decision, the 4090 vs cloud H100 piece, and consider hybrid pairings as in the 4090 + 5060 Ti hybrid guide. For lighter MVPs see the 5060 Ti MVP comparison.
Ship your AI MVP on one card
Chat, embeddings, images and voice on a single 4090. Predictable UK dedicated hosting.
Order the RTX 4090 24GBSee also: SaaS RAG stack, chatbot backend, document Q&A, image generation studio, 4090 vs OpenAI API cost, monthly hosting cost, ROI analysis.