Sub-two-second RAG queries on a single GPU. We benchmarked BGE-M3 embeddings alongside LLaMA 3 8B at full FP16 precision on the RTX 5090 (32 GB VRAM) running on a GigaGPU dedicated server. The result is the fastest single-card RAG performance in our test suite — 1.86 seconds from query to generated answer, with enough VRAM left over to add an entire additional model.
Models tested: BGE-M3 Embedding + LLaMA 3 8B
RAG Speed Benchmarks
| Component | Metric | Value |
|---|---|---|
| BGE-M3 Embedding | Tokens/sec | 1170 |
| BGE-M3 Embedding | Doc chunks/sec (256 tok) | 4.6 |
| LLaMA 3 8B (FP16) | Generation tok/sec | 85.0 |
| End-to-end RAG query | Latency (retrieve+generate) | 1.86s |
All models loaded simultaneously in GPU memory. Throughput figures reflect concurrent operation with shared VRAM and compute.
VRAM to Burn
| Component | VRAM |
|---|---|
| Combined model weights | 19.2 GB |
| Total RTX 5090 VRAM | 32 GB |
| Free headroom | ~12.8 GB |
Nearly 13 GB of VRAM sits unused after both models are loaded. That is not just headroom — it is an open invitation to build a richer pipeline. Add a cross-encoder re-ranker for precision-critical retrieval. Stack a Whisper model for voice-based RAG queries. Load a larger embedding model for multilingual corpora. The 5090 gives you options that smaller cards simply cannot offer.
Cost Perspective
| Cost Metric | Value |
|---|---|
| Server cost (single GPU) | £1.50/hr (£299/mo) |
| Equivalent separate GPUs | £3.00/hr |
| Savings vs separate servers | 50% |
At £299/mo, the 5090 is the premium option for self-hosted RAG. It justifies its price through raw speed: 85 tok/s generation and 1170 tok/s embedding mean both the retrieval and generation stages are faster than on any other card we tested. For high-traffic enterprise RAG deployments — internal search engines, compliance Q&A, or customer-facing knowledge assistants — the lower per-query latency directly improves user experience. See all benchmarks for the full comparison.
Enterprise-Grade RAG on One Card
The 5090 is built for RAG systems that need to scale. At 4.6 document chunks per second, it ingests new content fast enough to keep vector indices fresh in near-real-time. At 85 tok/s generation, answers come back before users lose patience. And with 12.8 GB of VRAM headroom, you have the flexibility to evolve your LangChain or custom RAG architecture without hardware constraints. This is the card for teams that have outgrown API-based RAG and need the throughput and control that only dedicated GPU infrastructure provides.
Quick deploy:
docker compose up -d # text-embeddings-inference + llama.cpp + chromadb containers
See our LLM hosting guide, RAG hosting guide, LangChain hosting, and all benchmark results. Related benchmarks: LLaMA 3 8B on RTX 5090.
Deploy RAG Pipeline on RTX 5090
Order this exact configuration. UK datacenter, full root access.
Order RTX 5090 Server