Slack AI is convenient but it also feeds every DM, channel message and wiki page into a third-party pipeline with uncertain residency guarantees. A Slack bot backed by a self-hosted Llama 3 8B on the RTX 5060 Ti 16GB at our UK dedicated GPU hosting does the same job without leaving your threat model. The Blackwell card delivers 4608 CUDA cores, 16 GB GDDR7 and native FP8, giving you 112 t/s single-stream and around 720 t/s aggregate across a team.
Contents
- Bolt architecture and socket mode
- Latency budget per message
- Team-size capacity
- Company RAG over wiki and drive
- Slack rate limits and streaming
Architecture
Use the Slack Bolt framework (Python or Node) in socket mode to avoid exposing a public webhook. The bot process is CPU-only and can sit next to your GPU box or in any VPS. It receives events over a websocket, forwards the message plus a system prompt to vLLM, streams the reply back in chunks, and posts to the originating thread using chat.postMessage with mrkdwn enabled.
Latency budget
| Stage | Typical time |
|---|---|
| Slack event delivery | 50-150 ms |
| RAG retrieval (optional) | 100-200 ms |
| Llama 3 8B FP8 first token | 120 ms |
| 300 output tokens at 112 t/s | 2.7 s |
| Slack post (streamed update) | 100 ms per chunk |
| Perceived total | ~0.5 s to first word, ~3 s to completion |
Stream with chat.update every 30-50 tokens so the user sees the reply grow in real time instead of waiting three seconds for a blob.
Team-size capacity
| Team size | Typical active chats/min | Headroom on one 5060 Ti |
|---|---|---|
| 50 people | 2-5 | Trivial |
| 500 people | 15-30 | Comfortable |
| 2,000 people | 60-100 | Fine with 16 concurrent streams, 300 t/s reserved |
| 5,000 people | 150-250 | Viable; add second card at ~400 |
Company RAG
Index your Confluence space, Google Drive folders, GitHub wiki and selected Slack channels. Embed with BGE-M3 (5,000 docs/sec on the 5060 Ti), store in Qdrant, retrieve top-20, rerank with BGE cross-encoder, feed top-5 chunks to Llama 3 8B. The bot answers with citations so users can click through to the source. See our document Q&A guide and RAG stack install.
Slack rate limits
Slack allows roughly 1 message/sec/channel on Tier 1 apps and burstier limits on Tier 2+. Queue updates and use exponential backoff on 429s. For DM-heavy bots, raise your app to Tier 2 via the App Directory review. vLLM happily handles the backpressure with its built-in request queue.
Private Slack AI, your keys, your rack
Llama 3 8B for team knowledge. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: chatbot backend, internal tooling, Llama 3 8B benchmark, SaaS RAG, FP8 Llama deployment.