RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 4090 24GB for Document Q&A Pipelines
Use Cases

RTX 4090 24GB for Document Q&A Pipelines

A complete document Q&A pipeline on the RTX 4090 24GB: PaddleOCR, BGE embeddings and Qwen 14B / Llama 70B AWQ. Legal-tech named workload, end-to-end latency tables, ingest capacity, and scaling triggers.

Document question-answering combines OCR, chunking, embedding, retrieval, reranking and a strong generation model — six stages where any one can blow your latency budget. The RTX 4090 24GB hosts every component except the vector store, and on a single UK GPU host you can ingest several million pages per day and answer queries in around 2-3 seconds end-to-end. This post documents the named workload — a UK legal-tech with a 1.4M-document corpus serving 80 lawyers — the pipeline architecture, and the production gotchas we have hit at scale.

Contents

Named workload: UK legal-tech

The reference deployment: a UK legal-tech serving 80 fee-earner lawyers across 6 offices. Corpus: 1.4 million documents (case law, contracts, regulatory filings, internal memos), averaging 12 pages per document with substantial scanned content. Daily query volume: 4,200 queries with bursts to 800/hour. SLA: median answer under 3 seconds, p95 under 5 seconds, with verbatim citations to source paragraphs.

The workload runs on three 4090s: one dedicated to ongoing OCR + embedding (new documents arrive at ~2,000/day from clients), one for query-time retrieval and reranking, one for LLM generation with a hot Qwen 2.5 14B AWQ default and a cold-loaded Llama 3.1 70B AWQ for low-confidence escalations. Vector store is Qdrant on a separate CPU box with 256GB RAM.

Pipeline architecture

StageToolVRAMThroughputLatency
OCRPaddleOCR PP-OCRv4~1.6 GB on demand50 pages/s batched20 ms/page
Chunk + cleanPython (CPU)fastnegligible
EmbeddingBGE-base-en-v1.50.8 GB12,000 chunks/s2 ms/query
Vector storeQdrant top-100CPU box18 ms p50
RerankerBGE-reranker-large1.4 GB2,400 pairs/s42 ms for 100→6
GenerationQwen 14B AWQ / Llama 70B AWQ9.5 / 21 GB135 / 23 t/ssee below

Total query-path VRAM (no OCR): 0.8 + 1.4 + 9.5 = 11.7 GB on the generation card, with 12 GB headroom for KV cache for concurrent queries. The OCR + ingest card runs OCR (1.6 GB) and embedding (0.8 GB) co-resident, leaving 21 GB free for batch growth and burst protection.

OCR throughput: PaddleOCR PP-OCRv4

PaddleOCR PP-OCRv4 on the 4090 processes English A4 pages at ~50 pages/s with detection + recognition batched 8-up. Multi-column scientific or legal PDFs run closer to 25 pages/s due to higher detection complexity. Scanned historical documents with poor contrast drop to 12-15 pages/s.

Document typePages/s (b=8)Pages/day per cardNotes
Clean digital PDF504,320,000Detection bypass
Standard scan (300 dpi)322,765,000Production default
Multi-column legal PDF252,160,000Layout analysis adds time
Historical scan (poor)121,036,000Angle classification needed

For the named legal-tech workload (2,000 documents/day, ~24,000 pages), OCR consumes about 12 minutes of GPU time per day. That leaves the OCR card almost entirely free for re-embedding when the BGE model is upgraded, or for parallel reranking duty during peak query hours. Bulk ingest of the initial 1.4M-document, 16.8M-page corpus took 6 hours wall-clock at the 50 pages/s clean-PDF rate.

Embedding and indexing

BGE-base-en-v1.5 at 12,000 chunks/s on the 4090 means a million-page corpus (~8M chunks at 256 tokens average) embeds in roughly 11 minutes once OCR is feeding it. We typically run OCR and embedding as two co-resident processes on the same 4090, with VRAM budgeted at 1.6 + 0.8 = 2.4 GB; plenty of headroom for concurrent reranking work.

The vector store choice matters more than the embedding model for query latency. Qdrant on a 256GB-RAM CPU box returns top-100 in 18ms p50 / 38ms p99 against 8M vectors with HNSW M=32, ef=200. pgvector hits similar p50 but worse tail latency. For corpora over 50M vectors, sharded Qdrant or Weaviate becomes necessary.

Generation models compared

ModelUseVRAMDecode b=1Active concurrentMMLU
Llama 3.1 8B FP8Fast first-pass10 GB198 t/s3069.4
Qwen 2.5 14B AWQDefault Q&A9.5 GB135 t/s1679.7
Qwen 2.5 32B AWQHard reasoning18 GB65 t/s683.3
Llama 3.1 70B AWQCitations / legal21 GB23 t/s3-486.0
Mixtral 8x7B AWQMultilingual16.5 GB85 t/s1070.6

Choose by tolerance for latency vs answer fidelity. Qwen 14B is the default for SaaS document QA — it sits at the sweet spot of cost, latency and quality. Llama 70B INT4 wins when verbatim citation accuracy and reasoning depth matter, at one-quarter the throughput and one-fifth the concurrent capacity. The legal-tech workload defaults to Qwen 14B and routes ~8% of queries (those flagged as low-confidence by a calibrated logprob threshold) to Llama 70B.

End-to-end latency budget

For a typical query — “What does the 2019 Smith v. Jones judgement say about constructive trust?” — against the legal corpus, end-to-end with Qwen 14B AWQ generation:

Stagep50p95p99
Embed query (BGE-base)2 ms4 ms8 ms
Vector store top-10018 ms38 ms62 ms
Rerank 100 → top-642 ms68 ms95 ms
Build prompt (template + chunks)4 ms8 ms12 ms
LLM prefill (~3,000 tokens, Qwen 14B)340 ms510 ms720 ms
LLM decode (400 tokens at 135 t/s)2.96 s3.4 s3.9 s
Total streamed~3.4 s~4.0 s~4.8 s
TTFT (user sees first token)~410 ms~620 ms~890 ms

TTFT around 410ms is well within “feels responsive” thresholds. With Llama 3 8B FP8 instead of Qwen 14B, total drops to ~2.1 seconds at the cost of measurably worse answer quality on legal reasoning — a trade-off the legal-tech rejected after a 2-week A/B test where senior partners marked Llama 8B answers as “needs revision” 23% more often.

Capacity and scaling triggers

TierPer-4090 capacityNotes
OCR ingest2.7M pages/day standard scansOne card sustained
Embedding throughput1B chunks/dayMostly idle in steady state
Reranking200M pair-evaluations/dayCo-resident with embedding
Qwen 14B generation~16 active users at SLAProduction target
Llama 70B INT4 escalation~3 active users at SLACold-loaded, ~10s warm-up
Mixed query load~30,000 queries/dayBurst-protected at 800/hour

Scaling triggers for the named legal-tech workload:

  • Add a generation card at 25,000 daily queries. One Qwen 14B card covers ~30k/day; beyond that p95 latency degrades.
  • Promote escalation tier from cold to warm at 15% routing rate. If more than 15% of queries hit Llama 70B, the 10-second cold-load on every burst becomes painful — make it permanently warm.
  • Move OCR off-cluster at 50,000+ pages/day sustained. Background ingest competes with generation if OCR runs at peak hours.
  • Step generation card to 5090 32GB when escalation rate exceeds 25%. Llama 70B AWQ + KV cache for several concurrent users wants more than 24GB.
  • Shard Qdrant at 50M+ vectors. Single-node HNSW p99 latency degrades visibly past this point.

For lighter-weight document QA footprints see the 5060 Ti RAG stack — works for under 5 concurrent users at SLA.

Production gotchas

  • OCR confidence is not a citation guarantee. PaddleOCR returns per-character confidence; a 0.94 average can hide a 0.2 token in the middle of a citation. Always store the original page image alongside text and serve both in the answer panel.
  • Chunk boundaries break on mid-sentence headers. Naive 256-token chunking splits “Section 7.2 Constructive trusts shall apply” mid-clause. Use semantic chunking (sentence/paragraph aware) for legal corpora.
  • Reranker biases toward verbatim matches. BGE reranker scores literal-string overlap heavily. For paraphrased queries, augment top-100 with HyDE (hypothetical document embedding) before rerank.
  • FP8 KV cache subtly reduces long-context recall. On 16k+ context with Qwen 14B FP8 KV, recall of facts in the middle of the prompt drops ~3pp. Use FP16 KV when prompt exceeds 12k.
  • Prefix caching gets stale. If you template the same system prompt with a date or session ID at the top, you bust the prefix cache every request. Move dynamic content to the bottom.
  • Qdrant payload writes block reads on default config. Enable WAL and set optimizer.indexing_threshold high during bulk ingest, low during query hours.
  • Llama 70B cold-load is 12-15 seconds. If you treat it as a fallback model loaded on demand, budget for the cold-start in your SLA. Better to keep it warm on a dedicated card.

Verdict: when to pick a 4090 for document QA

Pick the RTX 4090 24GB for document QA when you need self-hosted answers over a real corpus and your daily query volume is in the thousands rather than tens of thousands. The named legal-tech runs three 4090s for 4,200 daily queries against 1.4M documents at a fraction of the cost of comparable SaaS retrieval APIs. Step down to a 5060 Ti RAG stack for prototypes or sub-5-concurrent workloads. Step up to multiple 4090s before considering a single H100 — three 4090s cost less and give you natural failover. For corpora where the 70B escalation rate is high, the 5090 32GB with native FP8 KV is the next move.

Self-hosted document Q&A on one card

OCR, embed, retrieve, rerank, answer. 30,000 queries per day per card, 2.7M pages of OCR ingest. UK GPU hosting.

Order the RTX 4090 24GB

See also: SaaS RAG stack, Qwen 14B benchmark, Llama 70B INT4, Qwen 32B benchmark, PaddleOCR benchmark, 5060 Ti RAG stack, AWQ guide, vLLM setup, 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?