TL;DR
We built an AI receptionist for a dental clinic in Mexican Spanish that answers the phone, books appointments, checks real availability and confirms via WhatsApp. A fully controllable stack: Pipecat for voice orchestration (OSS alternative to Vapi/Retell), Deepgram for STT (with local Whisper.cpp as option), Cartesia for natural Spanish TTS (with local Coqui TTS as option), Claude Sonnet via LiteLLM for reasoning, Qdrant for business RAG (services, hours, FAQs), Langfuse for per-call traces and Redis for per-turn context. Twilio Media Streams connects to the PSTN. Per-call cost: ~$0.11 USD. First-response latency: 900-1200 ms. The full repo and compose are published.
Why not Vapi or Retell
Vapi and Retell are excellent SaaS — fast setup, a good default model, nice UI. Three limits that eventually matter:
- Spanish-Mexican sounds Iberian in the default voice. Cartesia allows cloning and fine-tuning; ElevenLabs charges more for the same.
- Telephony integration is locked to their providers; if you already have Twilio or Vonage, you fall off the happy path.
- Sensitive data in healthcare and legal requires on-prem — Vapi stores audio and transcripts in their infrastructure.
Pipecat (OSS by Daily.co, Apache 2.0) is the framework that solves this: declarative STT → LLM → TTS pipeline, interchangeable transports, deploy anywhere.
Architecture
Phone call ──► Twilio SIP Trunk / Voice
│
├── Media Streams (mu-law 8kHz audio)
│
▼
┌─────────────────────────────────────────────────┐
│ Pipecat pipeline (Python, container) │
│ │
│ Audio in → VAD → Deepgram STT │
│ │ │
│ ▼ │
│ Context Aggregator ◄─── Redis │
│ │ │
│ ▼ │
│ LLM (Claude via LiteLLM) │
│ │ │
│ Tool calls ──► [Qdrant RAG] [Calendar] │
│ │ │
│ ▼ │
│ Cartesia TTS │
│ │ │
│ ▼ │
│ Audio out │
└─────────────────────────────────────────────────┘
│
▼ traces
Langfuse
Minimal Pipecat pipeline
Python 3.11+. File agent.py:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import BaseOpenAILLMService # works with LiteLLM
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
async def run_agent(websocket, tenant_id: str):
transport = FastAPIWebsocketTransport(
websocket=websocket,
params=FastAPIWebsocketTransport.InputParams(
audio_sample_rate=8000, # Twilio mu-law
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
serializer="twilio",
),
)
stt = DeepgramSTTService(
api_key=os.getenv("DEEPGRAM_KEY"),
language="es",
model="nova-3",
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_KEY"),
voice_id="mx-female-warm-v1",
language="es",
speed=1.0,
)
llm = BaseOpenAILLMService(
api_key=os.getenv("LITELLM_MASTER_KEY"),
base_url="https://api.numoru.com/v1",
model="claude-sonnet",
)
tools = load_clinic_tools(tenant_id)
context = OpenAILLMContext(
messages=[{"role": "system", "content": system_prompt(tenant_id)}],
tools=tools,
)
pipeline = Pipeline([
transport.input(),
stt,
context.user_aggregator(),
llm,
tts,
transport.output(),
context.assistant_aggregator(),
])
runner = PipelineRunner()
task = PipelineTask(pipeline)
await runner.run(task)
System prompt
Rule: short, specific, with examples. The model doesn't need 2k tokens of personality.
You are Rocío, receptionist of Numoru Dental Clinic in Querétaro.
You speak warm, professional Mexican Spanish. You reply briefly (max 2
sentences per turn unless the patient asks for detail).
Your single goal: help book, reschedule or cancel appointments, and
answer general clinic questions.
Hard rules:
- NEVER diagnose or recommend treatment.
- NEVER promise prices or durations not present in the RAG.
- If you don't know, offer to hand off to a human receptionist.
- When booking, confirm verbally and send a WhatsApp with detail.
Hours: Mon-Sat 9-19h. After-hours emergencies:
hand off to the on-call line (tool: transfer_to_emergency).
Tools exposed to the LLM
TOOLS = [
{
"name": "search_clinic_info",
"description": "Search services, prices and FAQs.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
{
"name": "find_available_slot",
"description": "Find an available slot for a service and preference.",
"parameters": {
"type": "object",
"properties": {
"service_id": {"type": "string"},
"from_date": {"type": "string", "format": "date"},
"preferred_time": {"type": "string", "enum": ["morning", "afternoon", "any"]},
},
"required": ["service_id"],
},
},
{
"name": "book_appointment",
"description": "Book a confirmed appointment.",
"parameters": {
"type": "object",
"properties": {
"patient_phone": {"type": "string"},
"patient_name": {"type": "string"},
"slot_id": {"type": "string"},
"notes": {"type": "string"},
},
"required": ["patient_phone", "patient_name", "slot_id"],
},
},
{
"name": "transfer_to_human",
"description": "Transfer the call to a human receptionist.",
"parameters": {"type": "object", "properties": {"reason": {"type": "string"}}},
},
]
search_clinic_info queries Qdrant. find_available_slot queries Google Calendar via our MCP. book_appointment writes to both + fires a WhatsApp template.
Business RAG with Qdrant
Before launch we load:
- Service catalog (100-300 items): name, approximate price, duration, description.
- FAQs (30-80 items): "do you accept X insurance?", "is there parking?".
- Policies (10-20 items): cancellation, late arrival, deposit.
Chunking with Chonkie to respect sentences; embedding with text-embedding-3-small (768 dims) via LiteLLM; collection in Qdrant with payload_filter by tenant_id.
def build_retriever(tenant_id: str):
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
def retrieve(query: str, k: int = 4):
hits = client.search(
collection_name="clinic_kb",
query_vector=embed(query),
query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))]),
limit=k,
)
return [h.payload for h in hits]
return retrieve
When the model calls search_clinic_info, we retrieve top-4 and return only the useful text to context (no embeddings).
Latency: where the milliseconds go
Across 50 test calls:
| Component | p50 (ms) | p95 (ms) |
|---|---|---|
| STT (Deepgram Nova-3) | 180 | 320 |
| LLM first token (Claude Sonnet) | 540 | 880 |
| TTS first audio (Cartesia) | 160 | 290 |
| Time-to-first-speech | 920 | 1380 |
| Full turn | 2200 | 3600 |
The pain is the LLM. Mitigations used:
stream: trueon LiteLLM to start TTS on the first chunk.- <500 token prompt in most turns.
- Redis semantic cache for frequent FAQs (40% hit rate in production).
- Claude Haiku for initial classification, Sonnet only when intent needs reasoning.
Twilio: concrete integration
Twilio Console setup:
- SIP Trunk or Voice number with TwiML:
<Response>
<Connect>
<Stream url="wss://voz.numoru.com/ws/clinica-dental-123" />
</Connect>
</Response>
The websocket on our server receives mu-law 8kHz packets and hands them to Pipecat. Identical return in reverse.
Post-call WhatsApp integration
When a booking succeeds, the agent fires (as a tool side-effect) a WhatsApp Business Cloud template via our mcp-whatsapp:
Hi {{name}}, we've confirmed your appointment:
🦷 Service: {{service}}
📅 {{date}} at {{time}}
📍 {{address}}
Reply CHANGE to reschedule.
This drops no-shows 35-50% in our client base.
Guardrails
Two layers:
1. LLM guardrails (NeMo)
A declarative policy blocks responses where the agent offers diagnoses or unverified prices.
define user ask diagnosis
"do I have a cavity?"
"what's wrong with me?"
define bot refuse diagnosis
"I can't diagnose that. Would you like me to book you with the doctor?"
define flow
user ask diagnosis
bot refuse diagnosis
2. Deterministic guardrails
Python code validates that book_appointment never books:
- Outside clinic hours.
- In an already-taken slot.
- Without a valid phone number.
If it fails, the tool returns a structured error and the LLM retries or transfers to human.
Langfuse traces
Each call generates a session with spans for STT, LLM and TTS. We enrich with metadata:
tenant_id,phone_number_hashed(no plaintext PII).booked: bool,transferred: bool,duration_s.- Hourly automated evaluations: "did the agent follow the script?", "did it contradict RAG?".
The Langfuse dashboard lets us detect:
- Spikes in transferred calls → prompt or RAG failure.
- p95 latency > 2s → Cartesia or Deepgram trouble.
- Hallucinations per day → alert if >3 detected.
Deploying on Digital Ocean
On top of the OSS self-hosted stack:
services:
voice-agent:
image: numoru/voice-agent:latest
environment:
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
LITELLM_BASE_URL: http://litellm:4000
DEEPGRAM_KEY: ${DEEPGRAM_KEY}
CARTESIA_KEY: ${CARTESIA_KEY}
LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
QDRANT_URL: http://qdrant:6333
REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379/2
ports: ["8765:8765"]
networks: [core]
Nginx routes wss://voz.numoru.com/ to voice-agent:8765. A single container handles ~40 concurrent calls on the 8 GB droplet.
Real per-call cost
Average duration: 2.5 min.
| Item | USD |
|---|---|
| Deepgram STT ($0.0043/min) | 0.011 |
| Cartesia TTS ($0.065/1k chars, ~800 chars output) | 0.052 |
| Claude Sonnet (4 turns × 2k tokens avg) | 0.012 |
| Twilio PSTN ($0.013/min) | 0.033 |
| Total per call | ~0.11 |
Volume for a typical 20-bed clinic: 2000 calls/month → $220/month variable + $46 base stack = ~$270/month. A part-time human receptionist costs 8-12× that.
Dollar split across the managed vendors. Twilio is the biggest line — swap to a SIP provider like Telnyx to shave 20-30%.
Numoru production data, February 2026 client average.
Business & commercial impact
The two leaks the product closes
Dental clinics, legal practices and elective-care providers leak money in two ways that an AI receptionist fixes immediately: after-hours and lunch-hour missed calls (in dental, ~67% of patient calls happen outside 9-6pm business hours per ADA practice-management data) and same-day no-shows (industry average 15-25% of bookings). Every missed call is a $150-450 lost appointment; every no-show is a $120 gap in the schedule the clinic can't easily backfill.
Fraction of inbound calls by time window across 6 clients running the Numoru voice stack. Human-only clinics lose the grey bars.
Numoru client telemetry, Aug 2025 – Feb 2026. Baseline validated against ADA Health Policy Institute 2024 survey on patient contact patterns.
Industries and ticket ranges
Monthly pricing by vertical (Numoru 2026, USD)
Public case studies
Dental Intelligence — industry phone benchmark
Weave — appointment reminders impact
Deepgram + Cartesia — Spanish voice quality benchmark
Illustrative case — mid-size dental group
4-location dental group in Querétaro deploying Numoru voice agent
ROI calculator — replacing part-time phone staff
Single-location clinic: human receptionist vs Numoru voice agent (12 months)
| Numoru retainer (12 mo × $399) | −$4,788 |
| Per-call usage (24k × $0.11) | −$2,640 |
| Implementation (one-time) | −$3,500 |
| Recovered from missed calls (added bookings) | +$29,480 |
| Recovered from lower no-shows | +$28,800 |
| Human receptionist retained (not replaced) | $0 |
| Net year-1 gross contribution | +$47,352 |
The agent rarely replaces humans — it captures the volume humans can't. The highest-ROI configuration keeps the human receptionist and moves them to patient-facing work, while the agent handles phones.
Pricing tiers Numoru sells
- 1 location / 1 phone number
- Dental or legal vertical template
- Spanish (MX / CO / AR) voice
- WhatsApp confirmation
- Google Calendar integration
- Basic Langfuse dashboard
- 60-day launch SLA
- Up to 3 locations
- Dentrix / OpenDental / Abrera integration
- Vertical-tuned voice
- Custom system prompt
- Advanced guardrails (NeMo)
- Monthly evals via Promptfoo
- 24 / 5 human fallback SLA
- Unlimited locations
- Self-hosted option (on-prem)
- SAML / SSO, audit log
- Custom voice (Cartesia clone)
- Compliance addendum (HIPAA-equivalent)
- Dedicated CSM + eng PoC
- Migration from Vapi / Retell
Usage over the included bucket bills at $0.14 / call. Self-hosted option reduces per-call cost to ~$0.05 (electricity + amortization) at >15k calls / mo.
Fully-local options (no external APIs)
If the client requires zero external calls:
- STT: Whisper.cpp (medium-es) on a small GPU or quantized CPU.
- TTS: Coqui TTS with fine-tuned XTTS-v2.
- LLM: Llama 3.3 70B via vLLM or Qwen 2.5 32B.
Requires GPU (at least A10 or RTX 4090). Droplet cost rises to $500-800/month, but per-call drops to electricity + amortization. Break-even: ~15,000 calls/month.
Testing: evals with Promptfoo
50 synthetic scripts ("I want a Tuesday afternoon slot", "how much is whitening?") with expected answers. Suite runs in CI with Promptfoo + custom asserts. See Agent evals in CI/CD.
FAQ
Why not use OpenAI Realtime default voice?For Spanish it uses a single model with no fine style control for Mexican, plus vendor lock-in.
Does it handle dialects outside Mexico?Yes by configuring Deepgram with language="es" and Cartesia with the appropriate voice. Tested in Colombia, Argentina and Chile with minor tuning.
What if the patient talks very fast or over the agent?Silero VAD detects interruption; the pipeline cancels the in-progress TTS response and listens. Without this, it sounds like an interrupting robot.
What's the right first client?A dental clinic or mid-size law firm with 30-100 calls/day. Below that, ROI is slow; above, you need to scale.
Integration with calendars beyond Google?Microsoft 365 via mcp-calendar. Proprietary software (Dentrix, OpenDental) via custom adapter — 2-5 days of work.
Next steps
Repo at github.com/numoru-ia/voice-agent-es. The next piece covers how to orchestrate three agents (bookings, reminders, reviews) with LangGraph for a complete dental clinic, using this voice agent as the entry point.