Home / Blog / Use Cases / Build Question Answering API on GPU

Use Cases

Build Question Answering API on GPU

Build a production question answering API on a dedicated GPU server. Serve extractive and generative QA over your documents with retrieval-augmented generation, source citations, and confidence scoring — no per-query fees or data leaving your infrastructure.

Use Cases April 16, 2026 3 min read gigagpu

What You’ll Build

In 45 minutes, you will have a production question answering API that accepts natural language questions and returns accurate answers grounded in your document corpus, complete with source citations, confidence scores, and relevant context passages. Running retrieval-augmented generation on a dedicated GPU server with vLLM, your API answers questions from a knowledge base of 100,000+ documents in under two seconds — with zero per-query cost and all data on your infrastructure.

Cloud QA and RAG services charge per query and per indexed document. At 10,000 queries daily against a large knowledge base, costs reach $500-$2,000 monthly. Self-hosted RAG on open-source models handles unlimited queries against unlimited documents with predictable GPU costs and complete control over retrieval quality and answer generation.

Architecture Overview

The API combines a vector search layer with an LLM answer generator. Questions first pass through an embedding model to find the most relevant document passages from a vector database. The retrieved passages and the original question feed into the LLM, which generates an answer grounded in the provided context. The response includes the answer text, source document references, confidence estimate, and the context passages used.

The API layer accepts questions via REST or WebSocket for streaming answers. A pre-processing step rewrites ambiguous questions for better retrieval. A post-processing step verifies that answer claims appear in the retrieved context, flagging potential hallucinations. Pair with an AI chatbot frontend for multi-turn conversational QA.

GPU Requirements

Model Configuration	Recommended GPU	VRAM	Latency (p95)
8B LLM + embeddings	RTX 5090	24 GB	~1.5 seconds
70B LLM + embeddings	RTX 6000 Pro	40 GB	~2.5 seconds
70B + reranker	RTX 6000 Pro 96 GB	80 GB	~2.0 seconds

The embedding model uses 2-3GB VRAM, and the vector database runs on CPU, so the majority of GPU memory goes to the answer-generating LLM. Adding a reranker model between retrieval and generation improves answer quality by selecting the most relevant passages from a larger initial retrieval set. See our self-hosted LLM guide for RAG model combinations.

Step-by-Step Build

Deploy an embedding model and vLLM on your GPU server. Set up a vector database and build the RAG pipeline with citation tracking.

from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import requests, qdrant_client

app = FastAPI()
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
qdrant = qdrant_client.QdrantClient("localhost", port=6333)
VLLM_URL = "http://localhost:8000/v1/chat/completions"

@app.post("/v1/qa")
async def answer_question(question: str, collection: str = "docs",
                          top_k: int = 5):
    # Retrieve relevant passages
    query_embedding = embedder.encode(question).tolist()
    results = qdrant.search(
        collection_name=collection,
        query_vector=query_embedding,
        limit=top_k
    )

    context_passages = [
        {"text": r.payload["text"], "source": r.payload["source"],
         "score": r.score}
        for r in results
    ]
    context = "\n\n".join(p["text"] for p in context_passages)

    # Generate answer with citations
    resp = requests.post(VLLM_URL, json={
        "model": "meta-llama/Llama-3-70b-chat-hf",
        "messages": [
            {"role": "system", "content": "Answer based only on the "
             "provided context. Cite sources by number. If the context "
             "doesn't contain the answer, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\n"
             f"Question: {question}"}
        ],
        "max_tokens": 500, "temperature": 0.2
    })
    answer = resp.json()["choices"][0]["message"]["content"]

    return {"answer": answer, "sources": context_passages,
            "question": question}

Add document ingestion endpoints that chunk, embed, and index new documents automatically. For multi-turn conversations, maintain conversation history and resolve coreferences (“What about its pricing?” referencing a previously discussed product). The OpenAI-compatible endpoint lets you use standard chat interfaces. See production setup for optimising retrieval and generation together.

Answer Quality and Hallucination Control

Ground every answer in retrieved context to minimise hallucination. The system prompt instructs the model to cite sources and admit when the context does not contain an answer. Post-generation verification checks that key claims in the answer appear in the context passages — flagging unsupported statements for human review or automatic filtering.

Measure answer quality with retrieval precision (are retrieved passages relevant?), answer faithfulness (is the answer supported by context?), and answer completeness (does the answer address the question?). Log all queries and retrieved contexts for continuous improvement of your retrieval pipeline and answer generation prompts.

Deploy Your QA API

A self-hosted QA API turns your document corpus into an instantly queryable knowledge base. Power internal search, customer support, research tools, and product documentation with grounded, cited answers. Launch on GigaGPU dedicated GPU hosting and start answering questions. Browse more API use cases and tutorials in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Use Cases

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Build Question Answering API on GPU

What You’ll Build

Architecture Overview

GPU Requirements

Step-by-Step Build

Answer Quality and Hallucination Control

Deploy Your QA API

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Build Question Answering API on GPU

What You’ll Build

Architecture Overview

GPU Requirements

Step-by-Step Build

Answer Quality and Hallucination Control

Deploy Your QA API

Need a Dedicated GPU Server?

gigagpu

Related Articles

Build an AI-Powered Social Listening Tool on GPU

Predictive Maintenance: Sensor Analysis on GPU

RTX 5060 Ti 16GB for WordPress AI Plugin Backend

RTX 5060 Ti 16GB for Startup MVP

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?