RTX 4090 24GB for Document Q&A Pipelines GIGAGPU

Document question-answering combines OCR, chunking, embedding, retrieval, reranking and a strong generation model — six stages where any one can blow your latency budget. The RTX 4090 24GB hosts every component except the vector store, and on a single UK GPU host you can ingest several million pages per day and answer queries in around 2-3 seconds end-to-end. This post documents the named workload — a UK legal-tech with a 1.4M-document corpus serving 80 lawyers — the pipeline architecture, and the production gotchas we have hit at scale.

Named workload: UK legal-tech

The reference deployment: a UK legal-tech serving 80 fee-earner lawyers across 6 offices. Corpus: 1.4 million documents (case law, contracts, regulatory filings, internal memos), averaging 12 pages per document with substantial scanned content. Daily query volume: 4,200 queries with bursts to 800/hour. SLA: median answer under 3 seconds, p95 under 5 seconds, with verbatim citations to source paragraphs.

The workload runs on three 4090s: one dedicated to ongoing OCR + embedding (new documents arrive at ~2,000/day from clients), one for query-time retrieval and reranking, one for LLM generation with a hot Qwen 2.5 14B AWQ default and a cold-loaded Llama 3.1 70B AWQ for low-confidence escalations. Vector store is Qdrant on a separate CPU box with 256GB RAM.

Pipeline architecture

Stage	Tool	VRAM	Throughput	Latency
OCR	PaddleOCR PP-OCRv4	~1.6 GB on demand	50 pages/s batched	20 ms/page
Chunk + clean	Python (CPU)	—	fast	negligible
Embedding	BGE-base-en-v1.5	0.8 GB	12,000 chunks/s	2 ms/query
Vector store	Qdrant top-100	CPU box	—	18 ms p50
Reranker	BGE-reranker-large	1.4 GB	2,400 pairs/s	42 ms for 100→6
Generation	Qwen 14B AWQ / Llama 70B AWQ	9.5 / 21 GB	135 / 23 t/s	see below

Total query-path VRAM (no OCR): 0.8 + 1.4 + 9.5 = 11.7 GB on the generation card, with 12 GB headroom for KV cache for concurrent queries. The OCR + ingest card runs OCR (1.6 GB) and embedding (0.8 GB) co-resident, leaving 21 GB free for batch growth and burst protection.

OCR throughput: PaddleOCR PP-OCRv4

PaddleOCR PP-OCRv4 on the 4090 processes English A4 pages at ~50 pages/s with detection + recognition batched 8-up. Multi-column scientific or legal PDFs run closer to 25 pages/s due to higher detection complexity. Scanned historical documents with poor contrast drop to 12-15 pages/s.

Document type	Pages/s (b=8)	Pages/day per card	Notes
Clean digital PDF	50	4,320,000	Detection bypass
Standard scan (300 dpi)	32	2,765,000	Production default
Multi-column legal PDF	25	2,160,000	Layout analysis adds time
Historical scan (poor)	12	1,036,000	Angle classification needed

For the named legal-tech workload (2,000 documents/day, ~24,000 pages), OCR consumes about 12 minutes of GPU time per day. That leaves the OCR card almost entirely free for re-embedding when the BGE model is upgraded, or for parallel reranking duty during peak query hours. Bulk ingest of the initial 1.4M-document, 16.8M-page corpus took 6 hours wall-clock at the 50 pages/s clean-PDF rate.

Embedding and indexing

BGE-base-en-v1.5 at 12,000 chunks/s on the 4090 means a million-page corpus (~8M chunks at 256 tokens average) embeds in roughly 11 minutes once OCR is feeding it. We typically run OCR and embedding as two co-resident processes on the same 4090, with VRAM budgeted at 1.6 + 0.8 = 2.4 GB; plenty of headroom for concurrent reranking work.

The vector store choice matters more than the embedding model for query latency. Qdrant on a 256GB-RAM CPU box returns top-100 in 18ms p50 / 38ms p99 against 8M vectors with HNSW M=32, ef=200. pgvector hits similar p50 but worse tail latency. For corpora over 50M vectors, sharded Qdrant or Weaviate becomes necessary.

Generation models compared

Model	Use	VRAM	Decode b=1	Active concurrent	MMLU
Llama 3.1 8B FP8	Fast first-pass	10 GB	198 t/s	30	69.4
Qwen 2.5 14B AWQ	Default Q&A	9.5 GB	135 t/s	16	79.7
Qwen 2.5 32B AWQ	Hard reasoning	18 GB	65 t/s	6	83.3
Llama 3.1 70B AWQ	Citations / legal	21 GB	23 t/s	3-4	86.0
Mixtral 8x7B AWQ	Multilingual	16.5 GB	85 t/s	10	70.6

Choose by tolerance for latency vs answer fidelity. Qwen 14B is the default for SaaS document QA — it sits at the sweet spot of cost, latency and quality. Llama 70B INT4 wins when verbatim citation accuracy and reasoning depth matter, at one-quarter the throughput and one-fifth the concurrent capacity. The legal-tech workload defaults to Qwen 14B and routes ~8% of queries (those flagged as low-confidence by a calibrated logprob threshold) to Llama 70B.

End-to-end latency budget

For a typical query — “What does the 2019 Smith v. Jones judgement say about constructive trust?” — against the legal corpus, end-to-end with Qwen 14B AWQ generation:

Stage	p50	p95	p99
Embed query (BGE-base)	2 ms	4 ms	8 ms
Vector store top-100	18 ms	38 ms	62 ms
Rerank 100 → top-6	42 ms	68 ms	95 ms
Build prompt (template + chunks)	4 ms	8 ms	12 ms
LLM prefill (~3,000 tokens, Qwen 14B)	340 ms	510 ms	720 ms
LLM decode (400 tokens at 135 t/s)	2.96 s	3.4 s	3.9 s
Total streamed	~3.4 s	~4.0 s	~4.8 s
TTFT (user sees first token)	~410 ms	~620 ms	~890 ms

TTFT around 410ms is well within “feels responsive” thresholds. With Llama 3 8B FP8 instead of Qwen 14B, total drops to ~2.1 seconds at the cost of measurably worse answer quality on legal reasoning — a trade-off the legal-tech rejected after a 2-week A/B test where senior partners marked Llama 8B answers as “needs revision” 23% more often.

Capacity and scaling triggers

Tier	Per-4090 capacity	Notes
OCR ingest	2.7M pages/day standard scans	One card sustained
Embedding throughput	1B chunks/day	Mostly idle in steady state
Reranking	200M pair-evaluations/day	Co-resident with embedding
Qwen 14B generation	~16 active users at SLA	Production target
Llama 70B INT4 escalation	~3 active users at SLA	Cold-loaded, ~10s warm-up
Mixed query load	~30,000 queries/day	Burst-protected at 800/hour

Scaling triggers for the named legal-tech workload:

Add a generation card at 25,000 daily queries. One Qwen 14B card covers ~30k/day; beyond that p95 latency degrades.
Promote escalation tier from cold to warm at 15% routing rate. If more than 15% of queries hit Llama 70B, the 10-second cold-load on every burst becomes painful — make it permanently warm.
Move OCR off-cluster at 50,000+ pages/day sustained. Background ingest competes with generation if OCR runs at peak hours.
Step generation card to 5090 32GB when escalation rate exceeds 25%. Llama 70B AWQ + KV cache for several concurrent users wants more than 24GB.
Shard Qdrant at 50M+ vectors. Single-node HNSW p99 latency degrades visibly past this point.

For lighter-weight document QA footprints see the 5060 Ti RAG stack — works for under 5 concurrent users at SLA.

Production gotchas

OCR confidence is not a citation guarantee. PaddleOCR returns per-character confidence; a 0.94 average can hide a 0.2 token in the middle of a citation. Always store the original page image alongside text and serve both in the answer panel.
Chunk boundaries break on mid-sentence headers. Naive 256-token chunking splits “Section 7.2 Constructive trusts shall apply” mid-clause. Use semantic chunking (sentence/paragraph aware) for legal corpora.
Reranker biases toward verbatim matches. BGE reranker scores literal-string overlap heavily. For paraphrased queries, augment top-100 with HyDE (hypothetical document embedding) before rerank.
FP8 KV cache subtly reduces long-context recall. On 16k+ context with Qwen 14B FP8 KV, recall of facts in the middle of the prompt drops ~3pp. Use FP16 KV when prompt exceeds 12k.
Prefix caching gets stale. If you template the same system prompt with a date or session ID at the top, you bust the prefix cache every request. Move dynamic content to the bottom.
Qdrant payload writes block reads on default config. Enable WAL and set optimizer.indexing_threshold high during bulk ingest, low during query hours.
Llama 70B cold-load is 12-15 seconds. If you treat it as a fallback model loaded on demand, budget for the cold-start in your SLA. Better to keep it warm on a dedicated card.

Verdict: when to pick a 4090 for document QA

Pick the RTX 4090 24GB for document QA when you need self-hosted answers over a real corpus and your daily query volume is in the thousands rather than tens of thousands. The named legal-tech runs three 4090s for 4,200 daily queries against 1.4M documents at a fraction of the cost of comparable SaaS retrieval APIs. Step down to a 5060 Ti RAG stack for prototypes or sub-5-concurrent workloads. Step up to multiple 4090s before considering a single H100 — three 4090s cost less and give you natural failover. For corpora where the 70B escalation rate is high, the 5090 32GB with native FP8 KV is the next move.

Self-hosted document Q&A on one card

OCR, embed, retrieve, rerank, answer. 30,000 queries per day per card, 2.7M pages of OCR ingest. UK GPU hosting.

Order the RTX 4090 24GB

RTX 4090 24GB for Document Q&A Pipelines

Contents

Named workload: UK legal-tech

Pipeline architecture

OCR throughput: PaddleOCR PP-OCRv4

Embedding and indexing

Generation models compared

End-to-end latency budget

Capacity and scaling triggers

Production gotchas

Verdict: when to pick a 4090 for document QA

Self-hosted document Q&A on one card

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Document Q&A Pipelines

Contents

Named workload: UK legal-tech

Pipeline architecture

OCR throughput: PaddleOCR PP-OCRv4

Embedding and indexing

Generation models compared

End-to-end latency budget

Capacity and scaling triggers

Production gotchas

Verdict: when to pick a 4090 for document QA

Self-hosted document Q&A on one card

Need a Dedicated GPU Server?

gigagpu

Related Articles

AI for Real Estate: Self-Hosted

RTX 5060 Ti 16GB for Summarisation

Build Code Completion API on GPU

Property Valuation: AI Price Estimation on GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?