Document Q&A combines OCR, retrieval, and LLM generation. All three run on one RTX 5060 Ti 16GB at our hosting.
Contents
Pipeline
- PDF upload -> PaddleOCR extracts text + layout
- Chunk into 512-token segments
- Embed with BGE-base
- Store in Qdrant / pgvector
- User query -> embed -> retrieve top-K -> rerank -> Llama 3 8B answer
Ingest Throughput
| Stage | Rate |
|---|---|
| PDF -> text (PaddleOCR) | 34 pages/s |
| Text -> chunks + embeddings | 10,000 chunks/s |
| End-to-end ingest | ~25-30 pages/s |
A 10,000-page corpus indexes in ~6 minutes on one card.
Q&A Latency
- Embed query: 3 ms
- Retrieve top-K: 20 ms
- Rerank: 31 ms
- LLM answer (400 tokens): 2,000 ms
- Total: ~2.1 s
Enable prefix caching – repeated queries on the same document often hit cache.
Scale Limits
- Corpus size: unlimited (stored in vector DB, not VRAM)
- Concurrent Q&A users: ~16 active (Llama 3 8B SLA)
- Ingest backlog: process 100k pages overnight
For enterprise document Q&A with 1M+ pages, use the card for retrieval+LLM only and offload OCR to a separate pool if ingest is the bottleneck.
Document Q&A on Blackwell 16GB
OCR + retrieval + LLM, one card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: PaddleOCR benchmark, SaaS RAG, RAG install, legal AI, healthcare.