What You’ll Build
In 45 minutes, you will have a production question answering API that accepts natural language questions and returns accurate answers grounded in your document corpus, complete with source citations, confidence scores, and relevant context passages. Running retrieval-augmented generation on a dedicated GPU server with vLLM, your API answers questions from a knowledge base of 100,000+ documents in under two seconds — with zero per-query cost and all data on your infrastructure.
Cloud QA and RAG services charge per query and per indexed document. At 10,000 queries daily against a large knowledge base, costs reach $500-$2,000 monthly. Self-hosted RAG on open-source models handles unlimited queries against unlimited documents with predictable GPU costs and complete control over retrieval quality and answer generation.
Architecture Overview
The API combines a vector search layer with an LLM answer generator. Questions first pass through an embedding model to find the most relevant document passages from a vector database. The retrieved passages and the original question feed into the LLM, which generates an answer grounded in the provided context. The response includes the answer text, source document references, confidence estimate, and the context passages used.
The API layer accepts questions via REST or WebSocket for streaming answers. A pre-processing step rewrites ambiguous questions for better retrieval. A post-processing step verifies that answer claims appear in the retrieved context, flagging potential hallucinations. Pair with an AI chatbot frontend for multi-turn conversational QA.
GPU Requirements
| Model Configuration | Recommended GPU | VRAM | Latency (p95) |
|---|---|---|---|
| 8B LLM + embeddings | RTX 5090 | 24 GB | ~1.5 seconds |
| 70B LLM + embeddings | RTX 6000 Pro | 40 GB | ~2.5 seconds |
| 70B + reranker | RTX 6000 Pro 96 GB | 80 GB | ~2.0 seconds |
The embedding model uses 2-3GB VRAM, and the vector database runs on CPU, so the majority of GPU memory goes to the answer-generating LLM. Adding a reranker model between retrieval and generation improves answer quality by selecting the most relevant passages from a larger initial retrieval set. See our self-hosted LLM guide for RAG model combinations.
Step-by-Step Build
Deploy an embedding model and vLLM on your GPU server. Set up a vector database and build the RAG pipeline with citation tracking.
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import requests, qdrant_client
app = FastAPI()
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
qdrant = qdrant_client.QdrantClient("localhost", port=6333)
VLLM_URL = "http://localhost:8000/v1/chat/completions"
@app.post("/v1/qa")
async def answer_question(question: str, collection: str = "docs",
top_k: int = 5):
# Retrieve relevant passages
query_embedding = embedder.encode(question).tolist()
results = qdrant.search(
collection_name=collection,
query_vector=query_embedding,
limit=top_k
)
context_passages = [
{"text": r.payload["text"], "source": r.payload["source"],
"score": r.score}
for r in results
]
context = "\n\n".join(p["text"] for p in context_passages)
# Generate answer with citations
resp = requests.post(VLLM_URL, json={
"model": "meta-llama/Llama-3-70b-chat-hf",
"messages": [
{"role": "system", "content": "Answer based only on the "
"provided context. Cite sources by number. If the context "
"doesn't contain the answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\n"
f"Question: {question}"}
],
"max_tokens": 500, "temperature": 0.2
})
answer = resp.json()["choices"][0]["message"]["content"]
return {"answer": answer, "sources": context_passages,
"question": question}
Add document ingestion endpoints that chunk, embed, and index new documents automatically. For multi-turn conversations, maintain conversation history and resolve coreferences (“What about its pricing?” referencing a previously discussed product). The OpenAI-compatible endpoint lets you use standard chat interfaces. See production setup for optimising retrieval and generation together.
Answer Quality and Hallucination Control
Ground every answer in retrieved context to minimise hallucination. The system prompt instructs the model to cite sources and admit when the context does not contain an answer. Post-generation verification checks that key claims in the answer appear in the context passages — flagging unsupported statements for human review or automatic filtering.
Measure answer quality with retrieval precision (are retrieved passages relevant?), answer faithfulness (is the answer supported by context?), and answer completeness (does the answer address the question?). Log all queries and retrieved contexts for continuous improvement of your retrieval pipeline and answer generation prompts.
Deploy Your QA API
A self-hosted QA API turns your document corpus into an instantly queryable knowledge base. Power internal search, customer support, research tools, and product documentation with grounded, cited answers. Launch on GigaGPU dedicated GPU hosting and start answering questions. Browse more API use cases and tutorials in our library.