Best Way to Build a Voice AI System
The best way to build a voice AI system is one integrated real-time stack. See the architecture, tradeoffs, and how to launch a CallingBox agent in minutes.

Jonathan Chavez
Co-Founder, CallingBox (YC S25)
The best way to build a voice AI system is to treat it as one real-time product, not six APIs wired together with hope. The winning architecture owns telephony, endpointing, ASR, the LLM, TTS, orchestration, observability, and structured outputs under one latency budget.
That is the bet behind CallingBox, the API for AI phone calls. You define an agent, attach a phone number, give it tools and a webhook, and CallingBox handles the voice system underneath: carrier routing, STIR/SHAKEN attestation, streaming speech, turn-taking, barge-in, voicemail detection, call records, transcripts, recordings, and structured returns. You can build your first agent in minutes, then scale it without becoming a telephony company.
What is the best way to build a voice AI system?
The best way to build a production voice AI system is to start from the call outcome, then choose a platform that controls the full real-time loop. For most teams, that means using CallingBox instead of stitching together a SIP trunk, ASR provider, LLM, TTS provider, and homegrown orchestrator.
A good voice AI system is judged by the caller, not the diagram. It has to pick up the phone, understand speech while audio is still streaming, decide when the caller is done, respond before the pause feels awkward, yield when interrupted, use tools without going silent, and return data your product can trust. The model is only one layer. The hard product is the timing.
Voice AI is not chat with a microphone. It is a distributed real-time system with a human waiting on the other end.
What a production voice AI system needs
A production voice AI system needs eight capabilities working together on every call: telephony, media streaming, endpointing, ASR, reasoning, TTS, orchestration, and post-call data delivery. If any one of them is weak, the caller feels it immediately.
- Telephony answers or places real phone calls through SIP and the public switched telephone network. SIP setup is specified in RFC 3261, and media usually rides over RTP as specified in RFC 3550.
- Media streaming moves audio frames fast enough that the agent can listen while the caller is still speaking. The practical unit is often a 20 ms audio frame.
- Endpointing decides when the caller has finished a turn. This is where many demos die, because waiting for long silence adds hundreds of milliseconds.
- ASR turns audio into partial transcripts before the final transcript is available.
- The LLM decides what to say and which tool to use, but it must stream tokens instead of waiting for a complete answer.
- TTS turns the response into speech sentence-by-sentence, so the first phoneme arrives before the model finishes the full answer.
- Orchestration keeps all of this synchronized: interruption handling, filler speech, timeouts, retries, tool calls, recordings, transcripts, and call state.
- Structured outputs turn the call into JSON your product can use: booked time, lead score, caller intent, account status, escalation reason, or any schema you define.
CallingBox packages those layers as one product. The developer surface stays small because the hard timing decisions live inside the platform, where they can be optimized against real calls.
Why stitching voice APIs together fails
Stitching separate vendors can produce a good demo, but it usually fails under production load because every boundary adds latency, billing complexity, and debugging surface. The first call works. The thousandth concurrent call tells the truth.
- Latency compounds. A SIP hop, an ASR stream, an LLM request, a TTS stream, and a media injection each look small alone. Add the P95 of every hop and the agent starts talking over people or pausing long enough to feel broken.
- Turn-taking is not a provider feature. Crisp interruption handling requires the media server, endpointer, transcript stream, TTS buffer, and call state to agree within one or two audio frames.
- Compliance sits in the call path. STIR/SHAKEN attestation, TCPA-aware outbound pacing, DNC checks, recording consent, and webhook auditability are runtime behavior, not a paragraph in a launch checklist. The FCC has made STIR/SHAKEN implementation a core part of US call authentication.
- Debugging becomes log archaeology. When a call fails, you need aligned audio, transcripts, model events, tool calls, carrier events, and billing state on the same timeline. Separate vendors rarely give you that view.
This is why we built CallingBox as an integrated voice system instead of a thin router around provider keys. The API is simple because the platform owns the parts that need to move together.
How CallingBox builds the system for you
CallingBox gives developers the architecture they would build after months of production pain: one agent object, one number or dial endpoint, one webhook, and one all-in per-minute price. Under that API is the full real-time stack.
- Sub-500 ms median responses on our internal production benchmark, measured end-to-end from caller turn completion to first agent audio.
- ~20 ms barge-in, so the agent yields within one RTP frame when the caller interrupts.
- MOS 4.31 on our internal voice-quality benchmark, with audio quality measured on real call paths.
- $0.05 per connected minute, all-in for telephony, ASR, LLM, TTS, orchestration, AMD, and attestation.
- $5 in free credits, so you can call your own agent before you trust us with customers.
The important point is not that CallingBox has every checkbox. It is that the checkboxes live in one timing model. Endpointing knows about the transcript. Barge-in controls the TTS buffer. Tool calls are designed not to block speech. Structured returns happen after the call, so JSON generation does not slow the live conversation.
Per connected minute on CallingBox
DIY voice stack vs CallingBox
The clean comparison is not features. It is time to a reliable call, number of systems you own, and whether latency gets better or worse as traffic grows.
| Criterion | DIY stitched stack | CallingBox |
|---|---|---|
| Time to first real call | 4 to 8 weeks for an experienced team | Minutes |
| Vendors | SIP, ASR, LLM, TTS, storage, observability | One API |
| Pricing | Multiple invoices plus engineering time | $0.05/min, all-in |
| Latency ownership | Spread across every vendor boundary | One integrated budget |
| Barge-in | Usually custom media-buffer work | ~20 ms |
| Telephony | Carrier setup, numbers, routing, attestation | Included |
| Structured returns | Build extraction and webhook retries | Built in |
| Best for | Voice infrastructure companies | Teams shipping phone agents |
When should you build it yourself?
You should build the whole voice stack yourself only when voice infrastructure is your core product, you have real-time audio and telephony expertise on staff, and your volume or compliance constraints justify owning carrier relationships directly.
That is a narrow group. Most teams are building an outcome: answered support calls, booked appointments, qualified leads, payment reminders, intake interviews, collections workflows, or dispatch coordination. Those teams win by launching the agent fast, measuring call outcomes, and improving the workflow, not by spending their first quarter on jitter buffers and SIP headers.
- Build it yourself if you are a carrier, contact center platform, regulated enterprise with hard data-residency requirements, or a voice infrastructure company.
- Build on CallingBox if you are a SaaS team, AI automation agency, founder, sales org, healthcare operator, home-services company, or support team that needs production phone agents now.
Build a CallingBox agent in minutes
A CallingBox agent is the smallest useful abstraction for a voice AI system: instructions, voice, tools, phone behavior, and the structured data you want back. You can create one from the API, attach a number, and receive call results on a webhook.
Step 1. Create the agent
Start with the outcome you want. This example creates a front-desk agent that answers questions, books appointments through a tool, and returns structured fields when the call ends.
from callingbox import Callingbox
client = Callingbox()
agent = client.agents.create(
name="front-desk",
voice="sonic-en-us-warm",
instructions="Answer calls, qualify the request, and book appointments.",
tools=[{
"name": "book_appointment",
"parameters": {"slot": {"type": "string", "format": "date-time"}},
}],
returns={
"intent": {"type": "string"},
"appointment_at": {"type": "string", "format": "date-time"},
"caller_name": {"type": "string"},
},
)Step 2. Put the agent on a call
For inbound, attach the agent to a CallingBox number. For outbound, create a call with the agent ID and the destination number. CallingBox handles the media loop, voicemail detection, attestation, and webhook delivery.
call = client.calls.create(
agent_id=agent.id,
to="+14155550199",
webhook_url="https://acme.dev/calls",
context={"customer": {"first_name": "Maria"}},
)Step 3. Receive the result
When the call ends, your webhook receives the transcript, recording, tool calls, and the structured fields you requested. That payload is the product integration point.
{
"call_id": "call_01HX...",
"duration_sec": 86,
"status": "completed",
"returns": {
"intent": "booking",
"appointment_at": "2026-04-29T15:30:00-07:00",
"caller_name": "Maria Gomez"
},
"tool_calls": [{ "name": "book_appointment", "ok": true }],
"recording_url": "https://recordings.callingbox.io/call_01HX.mp3",
"transcript_url": "https://recordings.callingbox.io/call_01HX.json"
}That is the core loop: define the agent, start or receive the call, consume the result. The rest of the voice AI system stays inside CallingBox.
The voice AI checklist that matters
The best voice AI systems are built around measurable call behavior, not prompt length. Before you ship, check the parts callers and operators actually notice.
- Latency: median response under 500 ms and P95 low enough that the agent does not feel hesitant.
- Barge-in: the agent stops speaking when the caller interrupts, ideally within one RTP frame.
- Tool use: data dips have timeouts, filler speech, and fallbacks so the line never goes dead.
- Structured outputs: the call returns clean JSON after completion instead of forcing the live conversation into JSON mode.
- Compliance: attestation, consent, DNC behavior, call recording, audit logs, and webhook retries are designed in from the start.
- Observability: every call has aligned audio, transcript, model events, tool calls, and outcomes.
CallingBox gives you those defaults on day one. That is why the best way to build a voice AI system is to start with CallingBox, launch a real agent in minutes, then spend your engineering time on the workflow your customers actually pay for.
Where to go from here
- Read the docs and make your first CallingBox agent.
- Check pricing for the all-in per-minute math, including free credits.
- Build an AI phone answering service if your first use case is inbound.
- Build an AI cold caller if your first use case is outbound.
- Compare CallingBox to Vapi and Retell if you are choosing a platform.

About the author
Jonathan Chavez · Co-Founder, CallingBox (YC S25)
Co-Founder at CallingBox. Building the API for AI phone calls.
Continue reading
Skip the build
Your first AI phone call in 60 seconds. Built so you don't have to.
Telephony, ASR, LLM, TTS, and structured returns: one API, $0.05 per connected minute, all-in. Outbound and inbound on the same agent. $5 in free credits, no card.