Voice agents on real phone lines: what breaks first
Twilio, speech-to-text, LLMs, and text-to-streaming audio sound magical in a slide deck. In production the pain is barge-in, latency, and failure speech. Notes from building voice automation.
Voice is unforgiving because silence is data. Humans notice half a second of dead air the way they do not notice a slow REST call behind a spinner.
When wiring telephony → STT → model → TTS, the failure modes are rarely “the model was dumb.” They are turn-taking, echo, codec surprises, and over-chatty prompts that blow your latency budget before the first token even matters.
Latency is the product
If your pipeline cannot stream partial understanding—if you wait for complete sentences where users interrupt themselves—you lose the conversational illusion. I start from a budget in milliseconds and work backward: where can we overlap work, where must we serialize, where do we need a fast “filler phrase” while the heavy call catches up?
Barge-in is not a feature flag on day thirty
Handling “the user started talking while the bot was talking” is table stakes for natural calls. Deferring it creates demos that feel fine until the first impatient customer. I prototype interruption paths early, even ugly, because retrofitting them into a happy-path-first codebase hurts.
Give the model an off-ramp
Not every turn needs the biggest model. Classify intent, handle the cheap cases cheaply, escalate when confidence drops. Also: write the fallback script for when the ASR returns garbage—what does the agent say without gaslighting the caller?
Logging that respects humans
Record enough to debug session IDs, timing, and provider error codes—not content you would not want read aloud in a retrospective. Redaction and retention policies belong next to the feature spec, not after the incident.
Voice work rewards people who enjoy tight feedback loops and uncomfortable test calls on real handsets. If you are in that headspace and want to compare architecture notes, use the contact links on this site.