Home / Blog / Tutorials / RTX 5060 Ti 16 GB RAG Stack Install in Under an Hour

Tutorials

RTX 5060 Ti 16 GB RAG Stack Install in Under an Hour

The complete install recipe for a working RAG stack on a freshly-provisioned RTX 5060 Ti server. Llama 3.1, BGE, Qdrant, LiteLLM — wired up.

Tutorials May 5, 2026 1 min read gigagpu

Table of Contents

From a freshly-provisioned 5060 Ti server to a working RAG stack in under an hour. This is the literal install recipe.

TL;DR

Five components, ~50 minutes total. vLLM serves the LLM; TEI serves embeddings; Qdrant is the vector store; LiteLLM routes; a curl test verifies. All in 100 lines of bash.

1. vLLM with Llama 3.1 8B

sudo apt install -y python3.10-venv
python3.10 -m venv ~/llm && source ~/llm/bin/activate
pip install vllm==0.6.3

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --gpu-memory-utilization 0.55 \
  --port 8000 &

2. Embeddings with TEI

docker run -d --gpus all -p 8001:80 \
  -v /data/embed-cache:/data \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5

3. Qdrant

docker run -d -p 6333:6333 -p 6334:6334 \
  -v /data/qdrant:/qdrant/storage \
  qdrant/qdrant

4. LiteLLM

pip install litellm[proxy]
cat > ~/litellm-config.yaml <<EOF
model_list:
  - model_name: chat
    litellm_params:
      model: openai/Meta-Llama-3.1-8B-Instruct
      api_base: http://localhost:8000/v1
EOF
litellm --config ~/litellm-config.yaml --port 4000 &

5. Test

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"chat\",\"messages\":[{\"role\":\"user\",\"content\":\"hello\"}]}"

Bottom line

~50 minutes from blank server to RAG stack. Add your application code on top. See RAG architecture guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 5060 Ti 16 GB RAG Stack Install in Under an Hour

1. vLLM with Llama 3.1 8B

2. Embeddings with TEI

3. Qdrant

4. LiteLLM

5. Test

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 5060 Ti 16 GB RAG Stack Install in Under an Hour

1. vLLM with Llama 3.1 8B

2. Embeddings with TEI

3. Qdrant

4. LiteLLM

5. Test

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect Notion to Self-Hosted AI on GPU

RAG Chunking Strategy – What Actually Works

Milvus vs Weaviate: Distributed Vector Search Comparison

PyTorch CUDA Version Compatibility Matrix

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?