One card, one box, full production RAG stack. Here’s the end-to-end install for the RTX 5060 Ti 16GB at our hosting.
Contents
Architecture
[App] <-> [FastAPI orchestrator]
|
+--> vLLM (port 8000) - Llama 3 8B FP8
+--> TEI embed (port 8080) - BGE-base
+--> TEI rerank (port 8081) - BGE-reranker-base
+--> Qdrant (port 6333)
docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
ports: ["8000:8000"]
volumes: ["./hf-cache:/root/.cache/huggingface"]
environment: ["HF_TOKEN=hf_xxxxx"]
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--quantization fp8 --kv-cache-dtype fp8
--max-model-len 32768 --gpu-memory-utilization 0.60
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: all, capabilities: [gpu]}]
tei-embed:
image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.5
ports: ["8080:80"]
volumes: ["./tei-embed:/data"]
command: --model-id BAAI/bge-base-en-v1.5
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: all, capabilities: [gpu]}]
tei-rerank:
image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.5
ports: ["8081:80"]
volumes: ["./tei-rerank:/data"]
command: --model-id BAAI/bge-reranker-base
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: all, capabilities: [gpu]}]
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333", "6334:6334"]
volumes: ["./qdrant-data:/qdrant/storage"]
--gpu-memory-utilization 0.60 leaves room for the two TEI servers on the same GPU.
Ingest
1. Read files from your source
2. Chunk into 512-token segments
3. POST each chunk to TEI embed /embeddings
4. Upsert (chunk_id, vector, text) into Qdrant
At 10k texts/sec embedding rate, a 100k chunk corpus ingests in ~10 seconds.
Query Path
1. Query -> TEI embed (3 ms)
2. Qdrant search top-100 (20 ms)
3. TEI rerank 100 candidates (31 ms)
4. Pick top-4
5. Build prompt with context
6. vLLM generate answer (1,000-2,500 ms)
Total ~1-2.5 s depending on output length. Enable prefix caching to shave TTFT.
Full RAG Stack on Blackwell 16GB
vLLM + TEI + Qdrant, one box. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: TEI embedding, TEI rerank, SaaS RAG, vLLM setup, LangChain.