Document question-answering combines OCR, chunking, embedding, retrieval, reranking and a strong generation model — six stages where any one can blow your latency budget. The RTX 4090 24GB hosts every component except the vector store, and on a single UK GPU host you can ingest several million pages per day and answer queries in around 2-3 seconds end-to-end. This post documents the named workload — a UK legal-tech with a 1.4M-document corpus serving 80 lawyers — the pipeline architecture, and the production gotchas we have hit at scale.
Contents
- Named workload: UK legal-tech
- Pipeline architecture
- OCR throughput: PaddleOCR PP-OCRv4
- Embedding and indexing
- Generation models compared
- End-to-end latency budget
- Capacity and scaling triggers
- Production gotchas
- Verdict: when to pick a 4090
Named workload: UK legal-tech
The reference deployment: a UK legal-tech serving 80 fee-earner lawyers across 6 offices. Corpus: 1.4 million documents (case law, contracts, regulatory filings, internal memos), averaging 12 pages per document with substantial scanned content. Daily query volume: 4,200 queries with bursts to 800/hour. SLA: median answer under 3 seconds, p95 under 5 seconds, with verbatim citations to source paragraphs.
The workload runs on three 4090s: one dedicated to ongoing OCR + embedding (new documents arrive at ~2,000/day from clients), one for query-time retrieval and reranking, one for LLM generation with a hot Qwen 2.5 14B AWQ default and a cold-loaded Llama 3.1 70B AWQ for low-confidence escalations. Vector store is Qdrant on a separate CPU box with 256GB RAM.
Pipeline architecture
| Stage | Tool | VRAM | Throughput | Latency |
|---|---|---|---|---|
| OCR | PaddleOCR PP-OCRv4 | ~1.6 GB on demand | 50 pages/s batched | 20 ms/page |
| Chunk + clean | Python (CPU) | — | fast | negligible |
| Embedding | BGE-base-en-v1.5 | 0.8 GB | 12,000 chunks/s | 2 ms/query |
| Vector store | Qdrant top-100 | CPU box | — | 18 ms p50 |
| Reranker | BGE-reranker-large | 1.4 GB | 2,400 pairs/s | 42 ms for 100→6 |
| Generation | Qwen 14B AWQ / Llama 70B AWQ | 9.5 / 21 GB | 135 / 23 t/s | see below |
Total query-path VRAM (no OCR): 0.8 + 1.4 + 9.5 = 11.7 GB on the generation card, with 12 GB headroom for KV cache for concurrent queries. The OCR + ingest card runs OCR (1.6 GB) and embedding (0.8 GB) co-resident, leaving 21 GB free for batch growth and burst protection.
OCR throughput: PaddleOCR PP-OCRv4
PaddleOCR PP-OCRv4 on the 4090 processes English A4 pages at ~50 pages/s with detection + recognition batched 8-up. Multi-column scientific or legal PDFs run closer to 25 pages/s due to higher detection complexity. Scanned historical documents with poor contrast drop to 12-15 pages/s.
| Document type | Pages/s (b=8) | Pages/day per card | Notes |
|---|---|---|---|
| Clean digital PDF | 50 | 4,320,000 | Detection bypass |
| Standard scan (300 dpi) | 32 | 2,765,000 | Production default |
| Multi-column legal PDF | 25 | 2,160,000 | Layout analysis adds time |
| Historical scan (poor) | 12 | 1,036,000 | Angle classification needed |
For the named legal-tech workload (2,000 documents/day, ~24,000 pages), OCR consumes about 12 minutes of GPU time per day. That leaves the OCR card almost entirely free for re-embedding when the BGE model is upgraded, or for parallel reranking duty during peak query hours. Bulk ingest of the initial 1.4M-document, 16.8M-page corpus took 6 hours wall-clock at the 50 pages/s clean-PDF rate.
Embedding and indexing
BGE-base-en-v1.5 at 12,000 chunks/s on the 4090 means a million-page corpus (~8M chunks at 256 tokens average) embeds in roughly 11 minutes once OCR is feeding it. We typically run OCR and embedding as two co-resident processes on the same 4090, with VRAM budgeted at 1.6 + 0.8 = 2.4 GB; plenty of headroom for concurrent reranking work.
The vector store choice matters more than the embedding model for query latency. Qdrant on a 256GB-RAM CPU box returns top-100 in 18ms p50 / 38ms p99 against 8M vectors with HNSW M=32, ef=200. pgvector hits similar p50 but worse tail latency. For corpora over 50M vectors, sharded Qdrant or Weaviate becomes necessary.
Generation models compared
| Model | Use | VRAM | Decode b=1 | Active concurrent | MMLU |
|---|---|---|---|---|---|
| Llama 3.1 8B FP8 | Fast first-pass | 10 GB | 198 t/s | 30 | 69.4 |
| Qwen 2.5 14B AWQ | Default Q&A | 9.5 GB | 135 t/s | 16 | 79.7 |
| Qwen 2.5 32B AWQ | Hard reasoning | 18 GB | 65 t/s | 6 | 83.3 |
| Llama 3.1 70B AWQ | Citations / legal | 21 GB | 23 t/s | 3-4 | 86.0 |
| Mixtral 8x7B AWQ | Multilingual | 16.5 GB | 85 t/s | 10 | 70.6 |
Choose by tolerance for latency vs answer fidelity. Qwen 14B is the default for SaaS document QA — it sits at the sweet spot of cost, latency and quality. Llama 70B INT4 wins when verbatim citation accuracy and reasoning depth matter, at one-quarter the throughput and one-fifth the concurrent capacity. The legal-tech workload defaults to Qwen 14B and routes ~8% of queries (those flagged as low-confidence by a calibrated logprob threshold) to Llama 70B.
End-to-end latency budget
For a typical query — “What does the 2019 Smith v. Jones judgement say about constructive trust?” — against the legal corpus, end-to-end with Qwen 14B AWQ generation:
| Stage | p50 | p95 | p99 |
|---|---|---|---|
| Embed query (BGE-base) | 2 ms | 4 ms | 8 ms |
| Vector store top-100 | 18 ms | 38 ms | 62 ms |
| Rerank 100 → top-6 | 42 ms | 68 ms | 95 ms |
| Build prompt (template + chunks) | 4 ms | 8 ms | 12 ms |
| LLM prefill (~3,000 tokens, Qwen 14B) | 340 ms | 510 ms | 720 ms |
| LLM decode (400 tokens at 135 t/s) | 2.96 s | 3.4 s | 3.9 s |
| Total streamed | ~3.4 s | ~4.0 s | ~4.8 s |
| TTFT (user sees first token) | ~410 ms | ~620 ms | ~890 ms |
TTFT around 410ms is well within “feels responsive” thresholds. With Llama 3 8B FP8 instead of Qwen 14B, total drops to ~2.1 seconds at the cost of measurably worse answer quality on legal reasoning — a trade-off the legal-tech rejected after a 2-week A/B test where senior partners marked Llama 8B answers as “needs revision” 23% more often.
Capacity and scaling triggers
| Tier | Per-4090 capacity | Notes |
|---|---|---|
| OCR ingest | 2.7M pages/day standard scans | One card sustained |
| Embedding throughput | 1B chunks/day | Mostly idle in steady state |
| Reranking | 200M pair-evaluations/day | Co-resident with embedding |
| Qwen 14B generation | ~16 active users at SLA | Production target |
| Llama 70B INT4 escalation | ~3 active users at SLA | Cold-loaded, ~10s warm-up |
| Mixed query load | ~30,000 queries/day | Burst-protected at 800/hour |
Scaling triggers for the named legal-tech workload:
- Add a generation card at 25,000 daily queries. One Qwen 14B card covers ~30k/day; beyond that p95 latency degrades.
- Promote escalation tier from cold to warm at 15% routing rate. If more than 15% of queries hit Llama 70B, the 10-second cold-load on every burst becomes painful — make it permanently warm.
- Move OCR off-cluster at 50,000+ pages/day sustained. Background ingest competes with generation if OCR runs at peak hours.
- Step generation card to 5090 32GB when escalation rate exceeds 25%. Llama 70B AWQ + KV cache for several concurrent users wants more than 24GB.
- Shard Qdrant at 50M+ vectors. Single-node HNSW p99 latency degrades visibly past this point.
For lighter-weight document QA footprints see the 5060 Ti RAG stack — works for under 5 concurrent users at SLA.
Production gotchas
- OCR confidence is not a citation guarantee. PaddleOCR returns per-character confidence; a 0.94 average can hide a 0.2 token in the middle of a citation. Always store the original page image alongside text and serve both in the answer panel.
- Chunk boundaries break on mid-sentence headers. Naive 256-token chunking splits “Section 7.2 Constructive trusts shall apply” mid-clause. Use semantic chunking (sentence/paragraph aware) for legal corpora.
- Reranker biases toward verbatim matches. BGE reranker scores literal-string overlap heavily. For paraphrased queries, augment top-100 with HyDE (hypothetical document embedding) before rerank.
- FP8 KV cache subtly reduces long-context recall. On 16k+ context with Qwen 14B FP8 KV, recall of facts in the middle of the prompt drops ~3pp. Use FP16 KV when prompt exceeds 12k.
- Prefix caching gets stale. If you template the same system prompt with a date or session ID at the top, you bust the prefix cache every request. Move dynamic content to the bottom.
- Qdrant payload writes block reads on default config. Enable WAL and set
optimizer.indexing_thresholdhigh during bulk ingest, low during query hours. - Llama 70B cold-load is 12-15 seconds. If you treat it as a fallback model loaded on demand, budget for the cold-start in your SLA. Better to keep it warm on a dedicated card.
Verdict: when to pick a 4090 for document QA
Pick the RTX 4090 24GB for document QA when you need self-hosted answers over a real corpus and your daily query volume is in the thousands rather than tens of thousands. The named legal-tech runs three 4090s for 4,200 daily queries against 1.4M documents at a fraction of the cost of comparable SaaS retrieval APIs. Step down to a 5060 Ti RAG stack for prototypes or sub-5-concurrent workloads. Step up to multiple 4090s before considering a single H100 — three 4090s cost less and give you natural failover. For corpora where the 70B escalation rate is high, the 5090 32GB with native FP8 KV is the next move.
Self-hosted document Q&A on one card
OCR, embed, retrieve, rerank, answer. 30,000 queries per day per card, 2.7M pages of OCR ingest. UK GPU hosting.
Order the RTX 4090 24GBSee also: SaaS RAG stack, Qwen 14B benchmark, Llama 70B INT4, Qwen 32B benchmark, PaddleOCR benchmark, 5060 Ti RAG stack, AWQ guide, vLLM setup, 4090 spec breakdown.