Fast embedding and fast generation on the same GPU — that is what makes self-hosted RAG practical. We tested BGE-M3 and LLaMA 3 8B (INT4) running concurrently on a single RTX 5080 (16 GB VRAM) inside a GigaGPU dedicated server. The Blackwell architecture pushes both models hard, and the INT4 quantisation keeps the whole stack well within 16 GB.
Models tested: BGE-M3 Embedding + LLaMA 3 8B
Retrieval + Generation Numbers
| Component | Metric | Value |
|---|---|---|
| BGE-M3 Embedding | Tokens/sec | 870 |
| BGE-M3 Embedding | Doc chunks/sec (256 tok) | 3.4 |
| LLaMA 3 8B (INT4) | Generation tok/sec | 69.7 |
| End-to-end RAG query | Latency (retrieve+generate) | 2.25s |
All models loaded simultaneously in GPU memory. Throughput figures reflect concurrent operation with shared VRAM and compute.
Lean Memory Profile
| Component | VRAM |
|---|---|
| Combined model weights | 7.7 GB |
| Total RTX 5080 VRAM | 16 GB |
| Free headroom | ~8.3 GB |
The INT4-quantised LLM and the compact BGE-M3 encoder together consume under 8 GB, leaving over half the VRAM free. That generous headroom is not wasted — it accommodates large KV caches for multi-turn RAG conversations, batch embedding of multiple queries, and in-memory vector indices. You could even add a re-ranking model to improve retrieval quality without worrying about memory limits.
Economics of Self-Hosted RAG
| Cost Metric | Value |
|---|---|
| Server cost (single GPU) | £0.95/hr (£189/mo) |
| Equivalent separate GPUs | £1.90/hr |
| Savings vs separate servers | 50% |
At £189/mo, the 5080 runs the full RAG pipeline — embedding, retrieval, and generation — for a fixed monthly cost. Compare that to per-query pricing on managed RAG services and the breakeven happens fast, especially at enterprise query volumes. The 5080 also outperforms the RTX 3090 on both embedding speed (870 vs 570 tok/s) and generation speed (69.7 vs 52.7 tok/s). See all benchmarks for the complete lineup.
Best Fit for the 5080 RAG Stack
The 5080 is the ideal card for production RAG systems that serve moderate-to-high query traffic. At 2.25 seconds per query, it handles customer-facing knowledge bases, internal documentation search, and support ticket deflection with responsive performance. The 8.3 GB of free VRAM also makes it a natural choice for LangChain-based applications that chain multiple retrieval and generation steps. For maximum throughput or FP16 LLM precision, the RTX 5090 pushes latency down to 1.86s.
Quick deploy:
docker compose up -d # text-embeddings-inference + llama.cpp + chromadb containers
See our LLM hosting guide, RAG hosting guide, LangChain hosting, and all benchmark results. Related benchmarks: LLaMA 3 8B on RTX 5080.
Deploy RAG Pipeline on RTX 5080
Order this exact configuration. UK datacenter, full root access.
Order RTX 5080 Server