RTX 3050 - Order Now
Home / Blog / Tutorials / RTX 5060 Ti 16GB RAG Stack Install
Tutorials

RTX 5060 Ti 16GB RAG Stack Install

Full RAG stack on Blackwell 16GB - vLLM + TEI embeddings + reranker + Qdrant, end-to-end install.

One card, one box, full production RAG stack. Here’s the end-to-end install for the RTX 5060 Ti 16GB at our hosting.

Contents

Architecture

[App] <-> [FastAPI orchestrator]
              |
              +--> vLLM (port 8000)     - Llama 3 8B FP8
              +--> TEI embed (port 8080) - BGE-base
              +--> TEI rerank (port 8081) - BGE-reranker-base
              +--> Qdrant (port 6333)

docker-compose.yml

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports: ["8000:8000"]
    volumes: ["./hf-cache:/root/.cache/huggingface"]
    environment: ["HF_TOKEN=hf_xxxxx"]
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --quantization fp8 --kv-cache-dtype fp8
      --max-model-len 32768 --gpu-memory-utilization 0.60
    deploy:
      resources:
        reservations:
          devices: [{driver: nvidia, count: all, capabilities: [gpu]}]

  tei-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.5
    ports: ["8080:80"]
    volumes: ["./tei-embed:/data"]
    command: --model-id BAAI/bge-base-en-v1.5
    deploy:
      resources:
        reservations:
          devices: [{driver: nvidia, count: all, capabilities: [gpu]}]

  tei-rerank:
    image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.5
    ports: ["8081:80"]
    volumes: ["./tei-rerank:/data"]
    command: --model-id BAAI/bge-reranker-base
    deploy:
      resources:
        reservations:
          devices: [{driver: nvidia, count: all, capabilities: [gpu]}]

  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333", "6334:6334"]
    volumes: ["./qdrant-data:/qdrant/storage"]

--gpu-memory-utilization 0.60 leaves room for the two TEI servers on the same GPU.

Ingest

1. Read files from your source
2. Chunk into 512-token segments
3. POST each chunk to TEI embed /embeddings
4. Upsert (chunk_id, vector, text) into Qdrant

At 10k texts/sec embedding rate, a 100k chunk corpus ingests in ~10 seconds.

Query Path

1. Query -> TEI embed (3 ms)
2. Qdrant search top-100 (20 ms)
3. TEI rerank 100 candidates (31 ms)
4. Pick top-4
5. Build prompt with context
6. vLLM generate answer (1,000-2,500 ms)

Total ~1-2.5 s depending on output length. Enable prefix caching to shave TTFT.

Full RAG Stack on Blackwell 16GB

vLLM + TEI + Qdrant, one box. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: TEI embedding, TEI rerank, SaaS RAG, vLLM setup, LangChain.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?