All contributions
AI & Machine Learningvoice-agentpipecatvapi

Building a Spanish AI receptionist: Pipecat + Deepgram + Cartesia + Qdrant

Spanish-language voice agent for dental clinics and small law firms. OSS stack: Pipecat for orchestration, Deepgram/Whisper.cpp for STT, Cartesia/Coqui TTS for voice, Qdrant for business RAG, Langfuse for traces, Redis for context.

Numoru EngineeringPublished on May 3, 202616 min read
Share
Implementation proposalgithub.com/numoru-ia/voice-agent-es

TL;DR

We built an AI receptionist for a dental clinic in Mexican Spanish that answers the phone, books appointments, checks real availability and confirms via WhatsApp. A fully controllable stack: Pipecat for voice orchestration (OSS alternative to Vapi/Retell), Deepgram for STT (with local Whisper.cpp as option), Cartesia for natural Spanish TTS (with local Coqui TTS as option), Claude Sonnet via LiteLLM for reasoning, Qdrant for business RAG (services, hours, FAQs), Langfuse for per-call traces and Redis for per-turn context. Twilio Media Streams connects to the PSTN. Per-call cost: ~$0.11 USD. First-response latency: 900-1200 ms. The full repo and compose are published.

$0.11
Cost per 2.5-min call
Deepgram + Cartesia + Claude + Twilio
~920 ms
Time-to-first-speech (p50)
Streaming pipeline
-35 to -50%
No-show reduction
With WhatsApp confirmation
67%
After-hours calls in dental
Lost revenue if human-only

Why not Vapi or Retell

Vapi and Retell are excellent SaaS — fast setup, a good default model, nice UI. Three limits that eventually matter:

  1. Spanish-Mexican sounds Iberian in the default voice. Cartesia allows cloning and fine-tuning; ElevenLabs charges more for the same.
  2. Telephony integration is locked to their providers; if you already have Twilio or Vonage, you fall off the happy path.
  3. Sensitive data in healthcare and legal requires on-prem — Vapi stores audio and transcripts in their infrastructure.

Pipecat (OSS by Daily.co, Apache 2.0) is the framework that solves this: declarative STT → LLM → TTS pipeline, interchangeable transports, deploy anywhere.

Architecture

  Phone call ──► Twilio SIP Trunk / Voice
                   │
                   ├── Media Streams (mu-law 8kHz audio)
                   │
                   ▼
        ┌─────────────────────────────────────────────────┐
        │ Pipecat pipeline (Python, container)            │
        │                                                 │
        │  Audio in → VAD → Deepgram STT                  │
        │                     │                           │
        │                     ▼                           │
        │              Context Aggregator  ◄─── Redis     │
        │                     │                           │
        │                     ▼                           │
        │           LLM (Claude via LiteLLM)              │
        │                     │                           │
        │   Tool calls ──► [Qdrant RAG] [Calendar]        │
        │                     │                           │
        │                     ▼                           │
        │              Cartesia TTS                       │
        │                     │                           │
        │                     ▼                           │
        │                Audio out                        │
        └─────────────────────────────────────────────────┘
                              │
                              ▼ traces
                        Langfuse

Minimal Pipecat pipeline

Python 3.11+. File agent.py:

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import BaseOpenAILLMService  # works with LiteLLM
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

async def run_agent(websocket, tenant_id: str):
    transport = FastAPIWebsocketTransport(
        websocket=websocket,
        params=FastAPIWebsocketTransport.InputParams(
            audio_sample_rate=8000,  # Twilio mu-law
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            serializer="twilio",
        ),
    )

    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_KEY"),
        language="es",
        model="nova-3",
    )

    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_KEY"),
        voice_id="mx-female-warm-v1",
        language="es",
        speed=1.0,
    )

    llm = BaseOpenAILLMService(
        api_key=os.getenv("LITELLM_MASTER_KEY"),
        base_url="https://api.numoru.com/v1",
        model="claude-sonnet",
    )

    tools = load_clinic_tools(tenant_id)
    context = OpenAILLMContext(
        messages=[{"role": "system", "content": system_prompt(tenant_id)}],
        tools=tools,
    )

    pipeline = Pipeline([
        transport.input(),
        stt,
        context.user_aggregator(),
        llm,
        tts,
        transport.output(),
        context.assistant_aggregator(),
    ])

    runner = PipelineRunner()
    task = PipelineTask(pipeline)
    await runner.run(task)

System prompt

Rule: short, specific, with examples. The model doesn't need 2k tokens of personality.

You are Rocío, receptionist of Numoru Dental Clinic in Querétaro.
You speak warm, professional Mexican Spanish. You reply briefly (max 2
sentences per turn unless the patient asks for detail).

Your single goal: help book, reschedule or cancel appointments, and
answer general clinic questions.

Hard rules:
- NEVER diagnose or recommend treatment.
- NEVER promise prices or durations not present in the RAG.
- If you don't know, offer to hand off to a human receptionist.
- When booking, confirm verbally and send a WhatsApp with detail.

Hours: Mon-Sat 9-19h. After-hours emergencies:
hand off to the on-call line (tool: transfer_to_emergency).

Tools exposed to the LLM

TOOLS = [
    {
        "name": "search_clinic_info",
        "description": "Search services, prices and FAQs.",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "find_available_slot",
        "description": "Find an available slot for a service and preference.",
        "parameters": {
            "type": "object",
            "properties": {
                "service_id": {"type": "string"},
                "from_date": {"type": "string", "format": "date"},
                "preferred_time": {"type": "string", "enum": ["morning", "afternoon", "any"]},
            },
            "required": ["service_id"],
        },
    },
    {
        "name": "book_appointment",
        "description": "Book a confirmed appointment.",
        "parameters": {
            "type": "object",
            "properties": {
                "patient_phone": {"type": "string"},
                "patient_name": {"type": "string"},
                "slot_id": {"type": "string"},
                "notes": {"type": "string"},
            },
            "required": ["patient_phone", "patient_name", "slot_id"],
        },
    },
    {
        "name": "transfer_to_human",
        "description": "Transfer the call to a human receptionist.",
        "parameters": {"type": "object", "properties": {"reason": {"type": "string"}}},
    },
]

search_clinic_info queries Qdrant. find_available_slot queries Google Calendar via our MCP. book_appointment writes to both + fires a WhatsApp template.

Business RAG with Qdrant

Before launch we load:

  • Service catalog (100-300 items): name, approximate price, duration, description.
  • FAQs (30-80 items): "do you accept X insurance?", "is there parking?".
  • Policies (10-20 items): cancellation, late arrival, deposit.

Chunking with Chonkie to respect sentences; embedding with text-embedding-3-small (768 dims) via LiteLLM; collection in Qdrant with payload_filter by tenant_id.

def build_retriever(tenant_id: str):
    client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
    def retrieve(query: str, k: int = 4):
        hits = client.search(
            collection_name="clinic_kb",
            query_vector=embed(query),
            query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))]),
            limit=k,
        )
        return [h.payload for h in hits]
    return retrieve

When the model calls search_clinic_info, we retrieve top-4 and return only the useful text to context (no embeddings).

Latency: where the milliseconds go

Across 50 test calls:

Componentp50 (ms)p95 (ms)
STT (Deepgram Nova-3)180320
LLM first token (Claude Sonnet)540880
TTS first audio (Cartesia)160290
Time-to-first-speech9201380
Full turn22003600

The pain is the LLM. Mitigations used:

  • stream: true on LiteLLM to start TTS on the first chunk.
  • <500 token prompt in most turns.
  • Redis semantic cache for frequent FAQs (40% hit rate in production).
  • Claude Haiku for initial classification, Sonnet only when intent needs reasoning.

Twilio: concrete integration

Twilio Console setup:

  • SIP Trunk or Voice number with TwiML:
<Response>
  <Connect>
    <Stream url="wss://voz.numoru.com/ws/clinica-dental-123" />
  </Connect>
</Response>

The websocket on our server receives mu-law 8kHz packets and hands them to Pipecat. Identical return in reverse.

Post-call WhatsApp integration

When a booking succeeds, the agent fires (as a tool side-effect) a WhatsApp Business Cloud template via our mcp-whatsapp:

Hi {{name}}, we've confirmed your appointment:
🦷 Service: {{service}}
📅 {{date}} at {{time}}
📍 {{address}}

Reply CHANGE to reschedule.

This drops no-shows 35-50% in our client base.

Guardrails

Two layers:

1. LLM guardrails (NeMo)

A declarative policy blocks responses where the agent offers diagnoses or unverified prices.

define user ask diagnosis
  "do I have a cavity?"
  "what's wrong with me?"

define bot refuse diagnosis
  "I can't diagnose that. Would you like me to book you with the doctor?"

define flow
  user ask diagnosis
  bot refuse diagnosis

2. Deterministic guardrails

Python code validates that book_appointment never books:

  • Outside clinic hours.
  • In an already-taken slot.
  • Without a valid phone number.

If it fails, the tool returns a structured error and the LLM retries or transfers to human.

Langfuse traces

Each call generates a session with spans for STT, LLM and TTS. We enrich with metadata:

  • tenant_id, phone_number_hashed (no plaintext PII).
  • booked: bool, transferred: bool, duration_s.
  • Hourly automated evaluations: "did the agent follow the script?", "did it contradict RAG?".

The Langfuse dashboard lets us detect:

  • Spikes in transferred calls → prompt or RAG failure.
  • p95 latency > 2s → Cartesia or Deepgram trouble.
  • Hallucinations per day → alert if >3 detected.

Deploying on Digital Ocean

On top of the OSS self-hosted stack:

services:
  voice-agent:
    image: numoru/voice-agent:latest
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      LITELLM_BASE_URL: http://litellm:4000
      DEEPGRAM_KEY: ${DEEPGRAM_KEY}
      CARTESIA_KEY: ${CARTESIA_KEY}
      LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
      QDRANT_URL: http://qdrant:6333
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379/2
    ports: ["8765:8765"]
    networks: [core]

Nginx routes wss://voz.numoru.com/ to voice-agent:8765. A single container handles ~40 concurrent calls on the 8 GB droplet.

Real per-call cost

Average duration: 2.5 min.

ItemUSD
Deepgram STT ($0.0043/min)0.011
Cartesia TTS ($0.065/1k chars, ~800 chars output)0.052
Claude Sonnet (4 turns × 2k tokens avg)0.012
Twilio PSTN ($0.013/min)0.033
Total per call~0.11

Volume for a typical 20-bed clinic: 2000 calls/month → $220/month variable + $46 base stack = ~$270/month. A part-time human receptionist costs 8-12× that.

Per-call cost breakdown (2.5-min average)

Dollar split across the managed vendors. Twilio is the biggest line — swap to a SIP provider like Telnyx to shave 20-30%.

Numoru production data, February 2026 client average.

Business & commercial impact

Business & commercial impact

The two leaks the product closes

Dental clinics, legal practices and elective-care providers leak money in two ways that an AI receptionist fixes immediately: after-hours and lunch-hour missed calls (in dental, ~67% of patient calls happen outside 9-6pm business hours per ADA practice-management data) and same-day no-shows (industry average 15-25% of bookings). Every missed call is a $150-450 lost appointment; every no-show is a $120 gap in the schedule the clinic can't easily backfill.

Missed-call distribution in a typical 20-chair dental practice

Fraction of inbound calls by time window across 6 clients running the Numoru voice stack. Human-only clinics lose the grey bars.

9am-12pm (staffed)12-2pm (lunch)2-6pm (staffed)6-10pm (after-hours)10pm-9am (night)Weekend0%6%12%18%24%24%18%21%23%8%6%

Numoru client telemetry, Aug 2025 – Feb 2026. Baseline validated against ADA Health Policy Institute 2024 survey on patient contact patterns.

Industries and ticket ranges

Monthly pricing by vertical (Numoru 2026, USD)

Dental clinics
Booking + reschedule + WhatsApp confirmation + insurance eligibility FAQ.
$299 – 699 / mo
Per clinic · 12 mo
Law firms (family/immigration)
Intake, consultation scheduling, compliance-safe answers, fee quoting.
$399 – 899 / mo
12 mo · compliance addendum
Aesthetic / MedSpa
High-ticket elective services, deposit collection, up-sell upstream.
$499 – 1,200 / mo
12 mo
Veterinary
Booking, emergency triage hand-off, prescription refills.
$349 – 799 / mo
12 mo
Multi-location franchises (5+)
Central agent routing per location + analytics dashboard.
$2,400 – 5,500 / mo
Master contract
Call-center replacement (insurance, utilities)
Tier-1 triage, KYC light, hand-off to human on intent signals.
$4,000 – 12,000 / mo
Per-seat equivalent

Public case studies

Public case studyDental practice management · USA · 2024

Dental Intelligence — industry phone benchmark

Challenge
Quantify how many inbound calls mid-size dental practices actually miss and how that translates to schedule gaps.
Solution
Aggregate telemetry from 12,000+ North American practices running their practice-management integrations, cross-referenced with appointment logs.
Results
Average missed call rate
31%
Across all surveyed practices
Revenue per converted call
$292
Median appointment value
Lost annual revenue
$101K
Per average 8-chair practice
Public case studyDental + med SaaS · USA · 2023

Weave — appointment reminders impact

Challenge
Test whether automated reminders (SMS + call) move the needle on no-show rates.
Solution
Controlled rollout of 2-way SMS + voice reminders across 4,000 Weave customer practices with 12-month post measurement.
Results
No-show reduction
-38%
On practices with confirmations ON
Answered-call uplift
+22%
On voice-enabled cohort
Patient NPS
+14 pts
Net promoter score
Public case studyVoice AI vendors · Global · 2025

Deepgram + Cartesia — Spanish voice quality benchmark

Challenge
Benchmark Spanish-language STT and TTS quality against other providers.
Solution
Deepgram published Word Error Rate data for Nova-3 on Spanish; Cartesia published MOS scores for Sonic in Mexican Spanish.
Results
Deepgram Nova-3 WER (es)
7.8%
Vs. 13-18% for rivals
Cartesia Sonic MOS (es-MX)
4.41 / 5
Human-preferred in blind A/B
Avg time-to-first-token
90 ms
Sonic streaming TTS

Illustrative case — mid-size dental group

Illustrative caseHealthcare / dental · 18 chairs · 22 staff · Mexico

4-location dental group in Querétaro deploying Numoru voice agent

Baseline
2 full-time receptionists ($2,100 / mo blended), 3,200 inbound calls / mo across locations. 28% missed-call rate outside office hours. No-show rate 19%. Average appointment value $180 USD.
Intervention
Numoru voice agent deployed per location, shared Qdrant KB, integrated with Dentrix via MCP adapter. Human receptionists moved to in-person patient service. WhatsApp confirmations active from day 1.
Projected outcome (12 mo)
Calls answered
72% → 98%
Pickup rate
After-hours bookings
+$7,400 / mo
Previously $0
No-show rate
19% → 11%
~$5,100 / mo recovered
Platform cost
$1,196 / mo
4 × $299 tier
Net monthly contribution
+$11,304
Incremental vs. cost
Payback
3 weeks
Implementation one-time $4.5k
Uplift numbers anchored to Dental Intelligence 2024 and Weave 2023 cohort data. Cost assumptions calibrated to our own stack telemetry. Synthetic case — not a specific Numoru client.

ROI calculator — replacing part-time phone staff

Single-location clinic: human receptionist vs Numoru voice agent (12 months)

Payback: 2 months
Assumptions
Monthly inbound call volume2,000
Average call duration2.5 min
Average appointment value$180
Booking conversion, answered35%
Booking conversion, missed7%
Missed-call rate (human-only)28%
Missed-call rate (agent)2%
No-show rate (human-only)19%
No-show rate (agent + WhatsApp)11%
Numoru retainer (12 mo × $399)−$4,788
Per-call usage (24k × $0.11)−$2,640
Implementation (one-time)−$3,500
Recovered from missed calls (added bookings)+$29,480
Recovered from lower no-shows+$28,800
Human receptionist retained (not replaced)$0
Net year-1 gross contribution+$47,352

The agent rarely replaces humans — it captures the volume humans can't. The highest-ROI configuration keeps the human receptionist and moves them to patient-facing work, while the agent handles phones.

Pricing tiers Numoru sells

Pilot
$299/ month
Single location, 1,500 calls included.
  • 1 location / 1 phone number
  • Dental or legal vertical template
  • Spanish (MX / CO / AR) voice
  • WhatsApp confirmation
  • Google Calendar integration
  • Basic Langfuse dashboard
  • 60-day launch SLA
Practice
$699/ month
1-3 locations, 5,000 calls included.
  • Up to 3 locations
  • Dentrix / OpenDental / Abrera integration
  • Vertical-tuned voice
  • Custom system prompt
  • Advanced guardrails (NeMo)
  • Monthly evals via Promptfoo
  • 24 / 5 human fallback SLA
Enterprise
$2,400+/ month
5+ locations or franchise.
  • Unlimited locations
  • Self-hosted option (on-prem)
  • SAML / SSO, audit log
  • Custom voice (Cartesia clone)
  • Compliance addendum (HIPAA-equivalent)
  • Dedicated CSM + eng PoC
  • Migration from Vapi / Retell

Usage over the included bucket bills at $0.14 / call. Self-hosted option reduces per-call cost to ~$0.05 (electricity + amortization) at >15k calls / mo.

Fully-local options (no external APIs)

If the client requires zero external calls:

  • STT: Whisper.cpp (medium-es) on a small GPU or quantized CPU.
  • TTS: Coqui TTS with fine-tuned XTTS-v2.
  • LLM: Llama 3.3 70B via vLLM or Qwen 2.5 32B.

Requires GPU (at least A10 or RTX 4090). Droplet cost rises to $500-800/month, but per-call drops to electricity + amortization. Break-even: ~15,000 calls/month.

Testing: evals with Promptfoo

50 synthetic scripts ("I want a Tuesday afternoon slot", "how much is whitening?") with expected answers. Suite runs in CI with Promptfoo + custom asserts. See Agent evals in CI/CD.

FAQ

Why not use OpenAI Realtime default voice?For Spanish it uses a single model with no fine style control for Mexican, plus vendor lock-in.

Does it handle dialects outside Mexico?Yes by configuring Deepgram with language="es" and Cartesia with the appropriate voice. Tested in Colombia, Argentina and Chile with minor tuning.

What if the patient talks very fast or over the agent?Silero VAD detects interruption; the pipeline cancels the in-progress TTS response and listens. Without this, it sounds like an interrupting robot.

What's the right first client?A dental clinic or mid-size law firm with 30-100 calls/day. Below that, ROI is slow; above, you need to scale.

Integration with calendars beyond Google?Microsoft 365 via mcp-calendar. Proprietary software (Dentrix, OpenDental) via custom adapter — 2-5 days of work.

Next steps

Repo at github.com/numoru-ia/voice-agent-es. The next piece covers how to orchestrate three agents (bookings, reminders, reviews) with LangGraph for a complete dental clinic, using this voice agent as the entry point.

Want results like these for your company?

Start a conversation
Share