What You’ll Build
In about three hours, you will have a voice-powered AI scheduler that answers phone calls, understands appointment requests in natural speech, checks real-time calendar availability, negotiates suitable time slots, confirms bookings, and sends follow-up reminders via SMS or email. The system handles 20+ simultaneous calls on a single dedicated GPU server with natural-sounding voice interactions.
Missed calls cost businesses an estimated 20-30% of potential bookings. Hiring receptionists for after-hours and overflow calls adds significant payroll. A voice agent running on open-source models provides 24/7 scheduling capability without per-minute telephony AI charges, handling routine booking calls so staff focus on in-person service.
Architecture Overview
The scheduler chains four GPU models: Whisper for speech-to-text, an LLM through vLLM for conversational understanding and scheduling logic, Coqui TTS for natural speech synthesis, and a telephony bridge connecting to your phone system via SIP or a provider API. LangChain orchestrates the conversation flow with tool calling for calendar API access.
The LLM maintains conversation state including the caller’s preferences, available slots, and booking constraints. It accesses the calendar system through function calling to check availability and create appointments in real time. The voice pipeline operates in a streaming fashion: Whisper transcribes in chunks, the LLM generates response text, and TTS begins speaking before the full response is generated, keeping the conversational feel natural with minimal pauses.
GPU Requirements
| Call Volume | Recommended GPU | VRAM | Concurrent Calls |
|---|---|---|---|
| Up to 50 calls/day | RTX 5090 | 24 GB | ~8 simultaneous |
| 50 – 200 calls/day | RTX 6000 Pro | 40 GB | ~15 simultaneous |
| 200+ calls/day | RTX 6000 Pro 96 GB | 80 GB | ~25 simultaneous |
All three models (Whisper, LLM, TTS) must reside in VRAM simultaneously for real-time voice interaction. Whisper small or medium suffices for telephony audio quality. A fast 8B LLM provides the response speed needed for natural conversation. See our self-hosted LLM guide for voice pipeline model sizing.
Step-by-Step Build
Set up your GPU server with Whisper, vLLM, and Coqui TTS. Configure the telephony bridge to route inbound calls to your server. Build the conversation manager that coordinates the speech pipeline and maintains call state with calendar integration.
# Voice scheduler conversation prompt
SCHEDULER_PROMPT = """You are a friendly appointment scheduler for {business_name}.
Available services: {services_list}
Business hours: {hours}
Current availability: {available_slots}
Caller said: {transcribed_text}
Conversation history: {history}
Instructions:
- Greet warmly and ask what they need
- Offer 2-3 available time slots
- Confirm: name, service, date, time, phone number
- If no slots work, suggest alternatives
- Keep responses under 30 words for natural phone conversation
Available tools:
- check_availability(date, service) -> list of slots
- create_booking(name, service, datetime, phone) -> confirmation
- send_reminder(booking_id, method) -> sent"""
The confirmation flow verifies details by reading them back to the caller and handling corrections. Post-call, the system sends an SMS or email confirmation with booking details and a link to reschedule. Follow the voice agent server guide for implementing the streaming audio pipeline.
Performance and Call Quality
On an RTX 6000 Pro running the full voice stack, end-to-end response latency from caller speech to agent speech averages 1.1 seconds, which feels natural in phone conversation. Whisper achieves 94% transcription accuracy on telephony-quality audio. Appointment booking success rate reaches 87% for straightforward single-service bookings and 72% for complex multi-service or rescheduling requests.
The system gracefully handles edge cases: unintelligible speech triggers a polite re-ask, requests outside business capabilities transfer to a human queue, and caller interruptions are handled through voice activity detection. Call recordings and transcripts store locally for quality monitoring and training improvement.
Deploy Your Voice Scheduler
A voice-enabled AI scheduler captures every potential booking call, 24 hours a day, without per-minute telephony AI fees. Keep call recordings and customer data on your own infrastructure for privacy compliance. Launch on GigaGPU dedicated GPU hosting and stop losing bookings to missed calls. Explore more use case build patterns in our library.