Building a Spanish AI receptionist: Pipecat + Deepgram + Cartesia + Qdrant

Implementation proposalgithub.com/numoru-ia/voice-agent-es

TL;DR

We built an AI receptionist for a dental clinic in Mexican Spanish that answers the phone, books appointments, checks real availability and confirms via WhatsApp. A fully controllable stack: Pipecat for voice orchestration (OSS alternative to Vapi/Retell), Deepgram for STT (with local Whisper.cpp as option), Cartesia for natural Spanish TTS (with local Coqui TTS as option), Claude Sonnet via LiteLLM for reasoning, Qdrant for business RAG (services, hours, FAQs), Langfuse for per-call traces and Redis for per-turn context. Twilio Media Streams connects to the PSTN. Per-call cost: ~$0.11 USD. First-response latency: 900-1200 ms. The full repo and compose are published.

$0.11

Cost per 2.5-min call

Deepgram + Cartesia + Claude + Twilio

~920 ms

Time-to-first-speech (p50)

Streaming pipeline

-35 to -50%

No-show reduction

With WhatsApp confirmation

67%

After-hours calls in dental

Lost revenue if human-only

Why not Vapi or Retell

Vapi and Retell are excellent SaaS — fast setup, a good default model, nice UI. Three limits that eventually matter:

Spanish-Mexican sounds Iberian in the default voice. Cartesia allows cloning and fine-tuning; ElevenLabs charges more for the same.
Telephony integration is locked to their providers; if you already have Twilio or Vonage, you fall off the happy path.
Sensitive data in healthcare and legal requires on-prem — Vapi stores audio and transcripts in their infrastructure.

Pipecat (OSS by Daily.co, Apache 2.0) is the framework that solves this: declarative STT → LLM → TTS pipeline, interchangeable transports, deploy anywhere.

Architecture

  Phone call ──► Twilio SIP Trunk / Voice
                   │
                   ├── Media Streams (mu-law 8kHz audio)
                   │
                   ▼
        ┌─────────────────────────────────────────────────┐
        │ Pipecat pipeline (Python, container)            │
        │                                                 │
        │  Audio in → VAD → Deepgram STT                  │
        │                     │                           │
        │                     ▼                           │
        │              Context Aggregator  ◄─── Redis     │
        │                     │                           │
        │                     ▼                           │
        │           LLM (Claude via LiteLLM)              │
        │                     │                           │
        │   Tool calls ──► [Qdrant RAG] [Calendar]        │
        │                     │                           │
        │                     ▼                           │
        │              Cartesia TTS                       │
        │                     │                           │
        │                     ▼                           │
        │                Audio out                        │
        └─────────────────────────────────────────────────┘
                              │
                              ▼ traces
                        Langfuse

Minimal Pipecat pipeline

Python 3.11+. File agent.py:

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import BaseOpenAILLMService  # works with LiteLLM
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

async def run_agent(websocket, tenant_id: str):
    transport = FastAPIWebsocketTransport(
        websocket=websocket,
        params=FastAPIWebsocketTransport.InputParams(
            audio_sample_rate=8000,  # Twilio mu-law
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            serializer="twilio",
        ),
    )

    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_KEY"),
        language="es",
        model="nova-3",
    )

    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_KEY"),
        voice_id="mx-female-warm-v1",
        language="es",
        speed=1.0,
    )

    llm = BaseOpenAILLMService(
        api_key=os.getenv("LITELLM_MASTER_KEY"),
        base_url="https://api.numoru.com/v1",
        model="claude-sonnet",
    )

    tools = load_clinic_tools(tenant_id)
    context = OpenAILLMContext(
        messages=[{"role": "system", "content": system_prompt(tenant_id)}],
        tools=tools,
    )

    pipeline = Pipeline([
        transport.input(),
        stt,
        context.user_aggregator(),
        llm,
        tts,
        transport.output(),
        context.assistant_aggregator(),
    ])

    runner = PipelineRunner()
    task = PipelineTask(pipeline)
    await runner.run(task)

System prompt

Rule: short, specific, with examples. The model doesn't need 2k tokens of personality.

You are Rocío, receptionist of Numoru Dental Clinic in Querétaro.
You speak warm, professional Mexican Spanish. You reply briefly (max 2
sentences per turn unless the patient asks for detail).

Your single goal: help book, reschedule or cancel appointments, and
answer general clinic questions.

Hard rules:
- NEVER diagnose or recommend treatment.
- NEVER promise prices or durations not present in the RAG.
- If you don't know, offer to hand off to a human receptionist.
- When booking, confirm verbally and send a WhatsApp with detail.

Hours: Mon-Sat 9-19h. After-hours emergencies:
hand off to the on-call line (tool: transfer_to_emergency).

Tools exposed to the LLM

TOOLS = [
    {
        "name": "search_clinic_info",
        "description": "Search services, prices and FAQs.",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "find_available_slot",
        "description": "Find an available slot for a service and preference.",
        "parameters": {
            "type": "object",
            "properties": {
                "service_id": {"type": "string"},
                "from_date": {"type": "string", "format": "date"},
                "preferred_time": {"type": "string", "enum": ["morning", "afternoon", "any"]},
            },
            "required": ["service_id"],
        },
    },
    {
        "name": "book_appointment",
        "description": "Book a confirmed appointment.",
        "parameters": {
            "type": "object",
            "properties": {
                "patient_phone": {"type": "string"},
                "patient_name": {"type": "string"},
                "slot_id": {"type": "string"},
                "notes": {"type": "string"},
            },
            "required": ["patient_phone", "patient_name", "slot_id"],
        },
    },
    {
        "name": "transfer_to_human",
        "description": "Transfer the call to a human receptionist.",
        "parameters": {"type": "object", "properties": {"reason": {"type": "string"}}},
    },
]

search_clinic_info queries Qdrant. find_available_slot queries Google Calendar via our MCP. book_appointment writes to both + fires a WhatsApp template.

Business RAG with Qdrant

Before launch we load:

Service catalog (100-300 items): name, approximate price, duration, description.
FAQs (30-80 items): "do you accept X insurance?", "is there parking?".
Policies (10-20 items): cancellation, late arrival, deposit.

Chunking with Chonkie to respect sentences; embedding with text-embedding-3-small (768 dims) via LiteLLM; collection in Qdrant with payload_filter by tenant_id.

def build_retriever(tenant_id: str):
    client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
    def retrieve(query: str, k: int = 4):
        hits = client.search(
            collection_name="clinic_kb",
            query_vector=embed(query),
            query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))]),
            limit=k,
        )
        return [h.payload for h in hits]
    return retrieve

When the model calls search_clinic_info, we retrieve top-4 and return only the useful text to context (no embeddings).

Latency: where the milliseconds go

Across 50 test calls:

Component	p50 (ms)	p95 (ms)
STT (Deepgram Nova-3)	180	320
LLM first token (Claude Sonnet)	540	880
TTS first audio (Cartesia)	160	290
Time-to-first-speech	920	1380
Full turn	2200	3600

The pain is the LLM. Mitigations used:

stream: true on LiteLLM to start TTS on the first chunk.
<500 token prompt in most turns.
Redis semantic cache for frequent FAQs (40% hit rate in production).
Claude Haiku for initial classification, Sonnet only when intent needs reasoning.

Twilio: concrete integration

Twilio Console setup:

SIP Trunk or Voice number with TwiML:

<Response>
  <Connect>
    <Stream url="wss://voz.numoru.com/ws/clinica-dental-123" />
  </Connect>
</Response>

The websocket on our server receives mu-law 8kHz packets and hands them to Pipecat. Identical return in reverse.

Post-call WhatsApp integration

When a booking succeeds, the agent fires (as a tool side-effect) a WhatsApp Business Cloud template via our mcp-whatsapp:

Hi {{name}}, we've confirmed your appointment:
🦷 Service: {{service}}
📅 {{date}} at {{time}}
📍 {{address}}

Reply CHANGE to reschedule.

This drops no-shows 35-50% in our client base.

Guardrails

Two layers:

1. LLM guardrails (NeMo)

A declarative policy blocks responses where the agent offers diagnoses or unverified prices.

define user ask diagnosis
  "do I have a cavity?"
  "what's wrong with me?"

define bot refuse diagnosis
  "I can't diagnose that. Would you like me to book you with the doctor?"

define flow
  user ask diagnosis
  bot refuse diagnosis

2. Deterministic guardrails

Python code validates that book_appointment never books:

Outside clinic hours.
In an already-taken slot.
Without a valid phone number.

If it fails, the tool returns a structured error and the LLM retries or transfers to human.

Langfuse traces

Each call generates a session with spans for STT, LLM and TTS. We enrich with metadata:

tenant_id, phone_number_hashed (no plaintext PII).
booked: bool, transferred: bool, duration_s.
Hourly automated evaluations: "did the agent follow the script?", "did it contradict RAG?".

The Langfuse dashboard lets us detect:

Spikes in transferred calls → prompt or RAG failure.
p95 latency > 2s → Cartesia or Deepgram trouble.
Hallucinations per day → alert if >3 detected.

Deploying on Digital Ocean

On top of the OSS self-hosted stack:

services:
  voice-agent:
    image: numoru/voice-agent:latest
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      LITELLM_BASE_URL: http://litellm:4000
      DEEPGRAM_KEY: ${DEEPGRAM_KEY}
      CARTESIA_KEY: ${CARTESIA_KEY}
      LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
      QDRANT_URL: http://qdrant:6333
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379/2
    ports: ["8765:8765"]
    networks: [core]

Nginx routes wss://voz.numoru.com/ to voice-agent:8765. A single container handles ~40 concurrent calls on the 8 GB droplet.

Real per-call cost

Average duration: 2.5 min.

Item	USD
Deepgram STT ($0.0043/min)	0.011
Cartesia TTS ($0.065/1k chars, ~800 chars output)	0.052
Claude Sonnet (4 turns × 2k tokens avg)	0.012
Twilio PSTN ($0.013/min)	0.033
Total per call	~0.11

Volume for a typical 20-bed clinic: 2000 calls/month → $220/month variable + $46 base stack = ~$270/month. A part-time human receptionist costs 8-12× that.

Per-call cost breakdown (2.5-min average)

Dollar split across the managed vendors. Twilio is the biggest line — swap to a SIP provider like Telnyx to shave 20-30%.

Numoru production data, February 2026 client average.

Business & commercial impact

The two leaks the product closes

Dental clinics, legal practices and elective-care providers leak money in two ways that an AI receptionist fixes immediately: after-hours and lunch-hour missed calls (in dental, ~67% of patient calls happen outside 9-6pm business hours per ADA practice-management data) and same-day no-shows (industry average 15-25% of bookings). Every missed call is a $150-450 lost appointment; every no-show is a $120 gap in the schedule the clinic can't easily backfill.

Missed-call distribution in a typical 20-chair dental practice

Fraction of inbound calls by time window across 6 clients running the Numoru voice stack. Human-only clinics lose the grey bars.

Numoru client telemetry, Aug 2025 – Feb 2026. Baseline validated against ADA Health Policy Institute 2024 survey on patient contact patterns.

Industries and ticket ranges

Monthly pricing by vertical (Numoru 2026, USD)

Dental clinics

Booking + reschedule + WhatsApp confirmation + insurance eligibility FAQ.

$299 – 699 / mo

Per clinic · 12 mo

Law firms (family/immigration)

Intake, consultation scheduling, compliance-safe answers, fee quoting.

$399 – 899 / mo

12 mo · compliance addendum

Aesthetic / MedSpa

High-ticket elective services, deposit collection, up-sell upstream.

$499 – 1,200 / mo

12 mo

Veterinary

Booking, emergency triage hand-off, prescription refills.

$349 – 799 / mo

12 mo

Multi-location franchises (5+)

Central agent routing per location + analytics dashboard.

$2,400 – 5,500 / mo

Master contract

Call-center replacement (insurance, utilities)

Tier-1 triage, KYC light, hand-off to human on intent signals.

$4,000 – 12,000 / mo

Per-seat equivalent

Public case studies

Public case studyDental practice management · USA · 2024

Dental Intelligence — industry phone benchmark

Challenge

Quantify how many inbound calls mid-size dental practices actually miss and how that translates to schedule gaps.

Solution

Aggregate telemetry from 12,000+ North American practices running their practice-management integrations, cross-referenced with appointment logs.

Results

Average missed call rate

31%

Across all surveyed practices

Revenue per converted call

$292

Median appointment value

Lost annual revenue

$101K

Per average 8-chair practice

Source: Dental Intelligence Industry Phone Benchmark, 2024

Public case studyDental + med SaaS · USA · 2023

Weave — appointment reminders impact

Challenge

Test whether automated reminders (SMS + call) move the needle on no-show rates.

Solution

Controlled rollout of 2-way SMS + voice reminders across 4,000 Weave customer practices with 12-month post measurement.

Results

No-show reduction

-38%

On practices with confirmations ON

Answered-call uplift

+22%

On voice-enabled cohort

Patient NPS

+14 pts

Net promoter score

Source: Weave customer impact study, 2023

Public case studyVoice AI vendors · Global · 2025

Deepgram + Cartesia — Spanish voice quality benchmark

Challenge

Benchmark Spanish-language STT and TTS quality against other providers.

Solution

Deepgram published Word Error Rate data for Nova-3 on Spanish; Cartesia published MOS scores for Sonic in Mexican Spanish.

Results

Deepgram Nova-3 WER (es)

7.8%

Vs. 13-18% for rivals

Cartesia Sonic MOS (es-MX)

4.41 / 5

Human-preferred in blind A/B

Avg time-to-first-token

90 ms

Sonic streaming TTS

Source: Deepgram Nova-3 launch notes, Cartesia Sonic model card, 2024-2025

Illustrative case — mid-size dental group

Illustrative caseHealthcare / dental · 18 chairs · 22 staff · Mexico

4-location dental group in Querétaro deploying Numoru voice agent

Baseline

2 full-time receptionists ($2,100 / mo blended), 3,200 inbound calls / mo across locations. 28% missed-call rate outside office hours. No-show rate 19%. Average appointment value $180 USD.

Intervention

Numoru voice agent deployed per location, shared Qdrant KB, integrated with Dentrix via MCP adapter. Human receptionists moved to in-person patient service. WhatsApp confirmations active from day 1.

Projected outcome (12 mo)

Calls answered

72% → 98%

Pickup rate

After-hours bookings

+$7,400 / mo

Previously $0

No-show rate

19% → 11%

~$5,100 / mo recovered

Platform cost

$1,196 / mo

4 × $299 tier

Net monthly contribution

+$11,304

Incremental vs. cost

Payback

3 weeks

Implementation one-time $4.5k

Uplift numbers anchored to Dental Intelligence 2024 and Weave 2023 cohort data. Cost assumptions calibrated to our own stack telemetry. Synthetic case — not a specific Numoru client.

ROI calculator — replacing part-time phone staff

Single-location clinic: human receptionist vs Numoru voice agent (12 months)

Payback: 2 months

Assumptions

Monthly inbound call volume2,000

Average call duration2.5 min

Average appointment value$180

Booking conversion, answered35%

Booking conversion, missed7%

Missed-call rate (human-only)28%

Missed-call rate (agent)2%

No-show rate (human-only)19%

No-show rate (agent + WhatsApp)11%

Numoru retainer (12 mo × $399)	−$4,788
Per-call usage (24k × $0.11)	−$2,640
Implementation (one-time)	−$3,500
Recovered from missed calls (added bookings)	+$29,480
Recovered from lower no-shows	+$28,800
Human receptionist retained (not replaced)	$0
Net year-1 gross contribution	+$47,352

The agent rarely replaces humans — it captures the volume humans can't. The highest-ROI configuration keeps the human receptionist and moves them to patient-facing work, while the agent handles phones.

Pricing tiers Numoru sells

Pilot

$299/ month

Single location, 1,500 calls included.

1 location / 1 phone number
Dental or legal vertical template
Spanish (MX / CO / AR) voice
WhatsApp confirmation
Google Calendar integration
Basic Langfuse dashboard
60-day launch SLA

Practice

$699/ month

1-3 locations, 5,000 calls included.

Up to 3 locations
Dentrix / OpenDental / Abrera integration
Vertical-tuned voice
Custom system prompt
Advanced guardrails (NeMo)
Monthly evals via Promptfoo
24 / 5 human fallback SLA

Enterprise

$2,400+/ month

5+ locations or franchise.

Unlimited locations
Self-hosted option (on-prem)
SAML / SSO, audit log
Custom voice (Cartesia clone)
Compliance addendum (HIPAA-equivalent)
Dedicated CSM + eng PoC
Migration from Vapi / Retell

Usage over the included bucket bills at $0.14 / call. Self-hosted option reduces per-call cost to ~$0.05 (electricity + amortization) at >15k calls / mo.

Fully-local options (no external APIs)

If the client requires zero external calls:

STT: Whisper.cpp (medium-es) on a small GPU or quantized CPU.
TTS: Coqui TTS with fine-tuned XTTS-v2.
LLM: Llama 3.3 70B via vLLM or Qwen 2.5 32B.

Requires GPU (at least A10 or RTX 4090). Droplet cost rises to $500-800/month, but per-call drops to electricity + amortization. Break-even: ~15,000 calls/month.

Testing: evals with Promptfoo

50 synthetic scripts ("I want a Tuesday afternoon slot", "how much is whitening?") with expected answers. Suite runs in CI with Promptfoo + custom asserts. See Agent evals in CI/CD.

FAQ

Why not use OpenAI Realtime default voice?For Spanish it uses a single model with no fine style control for Mexican, plus vendor lock-in.

Does it handle dialects outside Mexico?Yes by configuring Deepgram with language="es" and Cartesia with the appropriate voice. Tested in Colombia, Argentina and Chile with minor tuning.

What if the patient talks very fast or over the agent?Silero VAD detects interruption; the pipeline cancels the in-progress TTS response and listens. Without this, it sounds like an interrupting robot.

What's the right first client?A dental clinic or mid-size law firm with 30-100 calls/day. Below that, ROI is slow; above, you need to scale.

Integration with calendars beyond Google?Microsoft 365 via mcp-calendar. Proprietary software (Dentrix, OpenDental) via custom adapter — 2-5 days of work.

Next steps

Repo at github.com/numoru-ia/voice-agent-es. The next piece covers how to orchestrate three agents (bookings, reminders, reviews) with LangGraph for a complete dental clinic, using this voice agent as the entry point.

TL;DR

Why not Vapi or Retell

Architecture

Minimal Pipecat pipeline

System prompt

Tools exposed to the LLM

Business RAG with Qdrant

Latency: where the milliseconds go

Twilio: concrete integration

Post-call WhatsApp integration

Guardrails

1. LLM guardrails (NeMo)

2. Deterministic guardrails

Langfuse traces

Deploying on Digital Ocean

Real per-call cost

Business & commercial impact

The two leaks the product closes

Industries and ticket ranges

Monthly pricing by vertical (Numoru 2026, USD)

Public case studies

Dental Intelligence — industry phone benchmark

Weave — appointment reminders impact

Deepgram + Cartesia — Spanish voice quality benchmark

Illustrative case — mid-size dental group

4-location dental group in Querétaro deploying Numoru voice agent

ROI calculator — replacing part-time phone staff

Single-location clinic: human receptionist vs Numoru voice agent (12 months)

Pricing tiers Numoru sells

Fully-local options (no external APIs)

Testing: evals with Promptfoo

FAQ

Next steps

Want results like these for your company?