What You’ll Connect
After this guide, your Telegram bot will answer messages using your own GPU-hosted LLM — delivering ChatGPT-like conversations inside Telegram with multi-turn memory, markdown formatting, and streaming-style progressive replies. The bot connects to your vLLM or Ollama endpoint on dedicated GPU hardware, giving your team or customers a private AI assistant accessible from any device where Telegram runs.
The integration uses the Telegram Bot API with webhooks for real-time message delivery. Your bot server receives messages, calls your OpenAI-compatible API, and sends the response back to the chat. Progressive replies use Telegram’s message editing to simulate streaming — updating the message as new tokens arrive.
Prerequisites
- A GigaGPU server running a self-hosted LLM (setup guide)
- A Telegram Bot Token from @BotFather
- Python 3.10+ with
python-telegram-botandhttpx - A public HTTPS endpoint for the webhook (your GPU server or a separate host)
Integration Steps
Create a Telegram bot via @BotFather and save the token. Set up a webhook URL pointing to your bot server — this can run on the same GPU server or a separate lightweight host. When a user sends a message, Telegram posts it to your webhook, your server calls the GPU endpoint, and sends the response back.
Implement conversation memory by storing chat history per Telegram user in Redis or a simple dictionary. Each incoming message appends to the user’s history, the full history sends to the LLM for context-aware replies, and the response appends to history. Cap history at a token limit to fit the model’s context window.
Add progressive replies for long responses: send an initial “Thinking…” message, then edit it with accumulated tokens every few hundred milliseconds. This simulates streaming in Telegram, which does not support true server-sent events in messages.
Code Example
Telegram bot with conversation memory and progressive replies from your self-hosted LLM:
from telegram import Update
from telegram.ext import Application, MessageHandler, filters
import httpx, asyncio
BOT_TOKEN = "your-telegram-bot-token"
GPU_URL = "https://your-gpu-server.gigagpu.com/v1/chat/completions"
GPU_KEY = "your-api-key"
chat_history = {} # user_id -> list of messages
async def handle_message(update: Update, context):
user_id = update.effective_user.id
user_text = update.message.text
if user_id not in chat_history:
chat_history[user_id] = [
{"role": "system", "content": "You are a helpful assistant."}
]
chat_history[user_id].append({"role": "user", "content": user_text})
# Send initial placeholder
reply = await update.message.reply_text("Thinking...")
full_response = ""
async with httpx.AsyncClient(timeout=60) as client:
async with client.stream("POST", GPU_URL, json={
"model": "meta-llama/Llama-3-70b-chat-hf",
"messages": chat_history[user_id][-20:], # last 20 messages
"stream": True, "max_tokens": 1024
}, headers={"Authorization": f"Bearer {GPU_KEY}"}) as resp:
buffer = ""
async for line in resp.aiter_lines():
if not line.startswith("data: ") or line == "data: [DONE]":
continue
import json
data = json.loads(line[6:])
token = data["choices"][0]["delta"].get("content", "")
full_response += token
buffer += token
# Update message every 20 characters
if len(buffer) > 20:
await reply.edit_text(full_response)
buffer = ""
if buffer:
await reply.edit_text(full_response, parse_mode="Markdown")
chat_history[user_id].append(
{"role": "assistant", "content": full_response}
)
app = Application.builder().token(BOT_TOKEN).build()
app.add_handler(MessageHandler(filters.TEXT, handle_message))
app.run_polling()
Testing Your Integration
Start the bot server and send a test message in Telegram. Verify the bot responds with AI-generated text and that the message updates progressively during generation. Send follow-up messages to test conversation memory — the bot should reference previous exchanges. Test the /clear command (add a handler) to reset conversation history.
Test with multiple users simultaneously to confirm chat histories are isolated per user. Test with long responses to verify Telegram’s message editing works smoothly. Check the Telegram rate limits — the bot can edit a message roughly once per second.
Production Tips
Move chat history from in-memory storage to Redis for persistence across server restarts and horizontal scaling. Add user authentication if the bot is for internal use — check the Telegram user ID against an allowlist. Implement a /model command that lets users switch between models hosted on your GPU server.
For group chats, configure the bot to respond only when mentioned (@botname) to avoid triggering on every message. Add rate limiting per user to prevent GPU abuse. Build a comprehensive AI chatbot with admin controls, usage analytics, and custom personas per group. Explore more tutorials or get started with GigaGPU to power your Telegram bot.