Enterprise search is mostly a permissions problem. A 5,000-employee company has millions of documents scattered across SharePoint, Google Drive, Confluence, Slack and Jira, each with its own ACL model. A self-hosted pipeline on the RTX 5060 Ti 16GB at our UK dedicated GPU hosting gives you the GPU horsepower to embed and rerank at enterprise scale (about 10,000 BGE-base embeddings per second) while keeping access-control enforcement inside your perimeter.
Contents
Sources
| Source | Connector | Permission model |
|---|---|---|
| SharePoint Online | Microsoft Graph /sites/*/drive | AAD groups, item-level |
| Google Drive / Workspace | Drive API with domain-wide delegation | User/group permissions |
| Confluence Cloud | REST API + space restrictions | Space + page restrictions |
| Slack | Events API + channel history scopes | Channel membership |
| Jira | REST /search?jql= | Project + issue security schemes |
| GitHub Enterprise | GraphQL + OAuth app | Repo visibility + team access |
Airbyte, Unstructured.io or Nuclia connectors handle most of this; for niche sources a custom extractor costs a few hundred lines of Python.
ACL-aware retrieval
Store each chunk with a readers field (list of group IDs that can see it). At query time, expand the user’s group memberships via AAD/Google and filter Qdrant results to readers IN user_groups. Check current permissions against the source system on the hot path for the top-5 results to avoid stale ACLs. Re-index nightly and on explicit permission-change webhooks where available.
Stack
- BGE-M3 multilingual embedder via TEI
- Qdrant with payload filtering on
readersandsource - BGE reranker v2 for precision on top-50
- Mistral 7B FP8 or Llama 3.1 8B FP8 for optional AI-answered queries with citations
- OpenSearch alongside for BM25 lexical channel
Scale
| Corpus size | Embedding time (5060 Ti) | Qdrant storage (int8) |
|---|---|---|
| 100k docs | ~20 s | ~100 MB |
| 1M docs | ~3.5 min | ~1 GB |
| 10M docs | ~35 min | ~10 GB |
| 100M chunks | ~5.6 h | ~100 GB |
Query serving: hundreds of concurrent searches per second on the same card. Typical end-to-end latency 80-150 ms without AI answer, 2-3 s with answer.
Deployment checklist
- SSO via SAML or OIDC against the same IdP that owns your source systems
- Encrypt embeddings at rest (Qdrant supports TLS + disk encryption)
- Audit log every query with user, timestamp and returned IDs for compliance
- Rate limit per user to prevent scraping
- Redact PII during ingestion if the policy requires it
Private enterprise search
ACL-aware retrieval on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: embedding throughput, document Q&A, internal tooling, RAG stack install, SaaS RAG.