Discord bots that wrap the OpenAI API rack up surprising monthly bills once a community gets active, and the per-message policy questions (“is the server being used to train models?”) keep resurfacing. Running a self-hosted Llama 3 8B on the RTX 5060 Ti 16GB at our UK dedicated GPU hosting removes both problems. Blackwell 4608 CUDA, 16 GB GDDR7 and native FP8 give 112 t/s single-stream and around 720 t/s aggregate, enough to support a large, chatty community on one card.
Contents
- discord.py / discord.js architecture
- Model selection
- Community capacity
- Voice-channel ASR option
- Monthly cost vs SaaS
Architecture
A single Python process using discord.py (or Node with discord.js) connects to the Discord gateway, listens for on_message and slash-command events, and forwards requests to vLLM. Example skeleton:
@bot.tree.command(name="ask")
async def ask(interaction, prompt: str):
await interaction.response.defer(thinking=True)
resp = await llm.chat.completions.create(
model="llama-3.1-8b-fp8",
messages=[{"role":"user","content":prompt}],
stream=True,
)
buf = ""
async for chunk in resp:
buf += chunk.choices[0].delta.content or ""
if len(buf) % 50 == 0:
await interaction.edit_original_response(content=buf)
await interaction.edit_original_response(content=buf)
Model selection
| Use case | Model | Why |
|---|---|---|
| General chat | Llama 3.1 8B FP8 | 112 t/s, friendly tone, strong factuality |
| Coding servers | Qwen 2.5 Coder 7B | Strong Python/JS/Go completion |
| Fast one-liners | Phi-3 mini FP8 | 285 t/s for instant replies |
| /image commands | SDXL Lightning 4-step | ~2 s per 1024×1024 on the same card |
| Long context (50k+) | Qwen 2.5 14B AWQ | 70 t/s with 32k context |
Capacity
| Server size | Typical msg/min to bot | 5060 Ti headroom |
|---|---|---|
| 1,000 members | 5-15 | Enormous |
| 10,000 members | 30-80 | Comfortable |
| 50,000 members | 150-300 | Tight; add second card at 400+ msg/min |
Discord imposes 5 slash commands/sec/guild and 50 messages/sec globally per bot; vLLM’s queue comfortably matches those ceilings.
Voice-channel ASR
Discord voice is Opus over UDP. Use discord-ext-voice-recv to capture PCM, feed into Whisper Turbo (roughly 1 minute of audio transcribed per second of GPU time on the 5060 Ti), and emit transcripts or summaries back to text channels. Optional diarisation with pyannote adds ~20 percent overhead. See our Whisper benchmark.
Cost
| Community profile / month | OpenAI GPT-4o-mini | Self-hosted 5060 Ti |
|---|---|---|
| 10k members, 2 interactions/user/day | ~£240 | Flat £300 |
| 50k members, 1 interaction/user/day | ~£600 | Flat £300 |
| Add /image (200/day) | ~£500 (DALL-E 3) | Same box |
Unlimited Discord bot replies
Blackwell 16GB for community AI. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: chatbot backend, internal tooling, Whisper benchmark, Llama 3 8B benchmark, webinar transcription.