voice ai9 min read

Best Way to Build a Voice AI System

The best way to build a voice AI system is one integrated real-time stack. See the architecture, tradeoffs, and how to launch a CallingBox agent in minutes.

Jonathan Chavez

Co-Founder, CallingBox (YC S25)

Apr 24, 2026

The best way to build a voice AI system is to treat it as one real-time product, not six APIs wired together with hope. The winning architecture owns telephony, endpointing, ASR, the LLM, TTS, orchestration, observability, and structured outputs under one latency budget.

That is the bet behind CallingBox, the API for AI phone calls. You define an agent, attach a phone number, give it tools and a webhook, and CallingBox handles the voice system underneath: carrier routing, STIR/SHAKEN attestation, streaming speech, turn-taking, barge-in, voicemail detection, call records, transcripts, recordings, and structured returns. You can build your first agent in minutes, then scale it without becoming a telephony company.

What is the best way to build a voice AI system?

The best way to build a production voice AI system is to start from the call outcome, then choose a platform that controls the full real-time loop. For most teams, that means using CallingBox instead of stitching together a SIP trunk, ASR provider, LLM, TTS provider, and homegrown orchestrator.

A good voice AI system is judged by the caller, not the diagram. It has to pick up the phone, understand speech while audio is still streaming, decide when the caller is done, respond before the pause feels awkward, yield when interrupted, use tools without going silent, and return data your product can trust. The model is only one layer. The hard product is the timing.

Voice AI is not chat with a microphone. It is a distributed real-time system with a human waiting on the other end.

The architecture rule

What a production voice AI system needs

A production voice AI system needs eight capabilities working together on every call: telephony, media streaming, endpointing, ASR, reasoning, TTS, orchestration, and post-call data delivery. If any one of them is weak, the caller feels it immediately.

Median wall-clock per turn · US-domestic inbound

Telephony answers or places real phone calls through SIP and the public switched telephone network. SIP setup is specified in RFC 3261, and media usually rides over RTP as specified in RFC 3550.
Media streaming moves audio frames fast enough that the agent can listen while the caller is still speaking. The practical unit is often a 20 ms audio frame.
Endpointing decides when the caller has finished a turn. This is where many demos die, because waiting for long silence adds hundreds of milliseconds.
ASR turns audio into partial transcripts before the final transcript is available.
The LLM decides what to say and which tool to use, but it must stream tokens instead of waiting for a complete answer.
TTS turns the response into speech sentence-by-sentence, so the first phoneme arrives before the model finishes the full answer.
Orchestration keeps all of this synchronized: interruption handling, filler speech, timeouts, retries, tool calls, recordings, transcripts, and call state.
Structured outputs turn the call into JSON your product can use: booked time, lead score, caller intent, account status, escalation reason, or any schema you define.

CallingBox packages those layers as one product. The developer surface stays small because the hard timing decisions live inside the platform, where they can be optimized against real calls.

Why stitching voice APIs together fails

Stitching separate vendors can produce a good demo, but it usually fails under production load because every boundary adds latency, billing complexity, and debugging surface. The first call works. The thousandth concurrent call tells the truth.

Latency compounds. A SIP hop, an ASR stream, an LLM request, a TTS stream, and a media injection each look small alone. Add the P95 of every hop and the agent starts talking over people or pausing long enough to feel broken.
Turn-taking is not a provider feature. Crisp interruption handling requires the media server, endpointer, transcript stream, TTS buffer, and call state to agree within one or two audio frames.
Compliance sits in the call path. STIR/SHAKEN attestation, TCPA-aware outbound pacing, DNC checks, recording consent, and webhook auditability are runtime behavior, not a paragraph in a launch checklist. The FCC has made STIR/SHAKEN implementation a core part of US call authentication.
Debugging becomes log archaeology. When a call fails, you need aligned audio, transcripts, model events, tool calls, carrier events, and billing state on the same timeline. Separate vendors rarely give you that view.

This is why we built CallingBox as an integrated voice system instead of a thin router around provider keys. The API is simple because the platform owns the parts that need to move together.

How CallingBox builds the system for you

CallingBox gives developers the architecture they would build after months of production pain: one agent object, one number or dial endpoint, one webhook, and one all-in per-minute price. Under that API is the full real-time stack.

Sub-500 ms median responses on our internal production benchmark, measured end-to-end from caller turn completion to first agent audio.
~20 ms barge-in, so the agent yields within one RTP frame when the caller interrupts.
MOS 4.31 on our internal voice-quality benchmark, with audio quality measured on real call paths.
$0.05 per connected minute, all-in for telephony, ASR, LLM, TTS, orchestration, AMD, and attestation.
$5 in free credits, so you can call your own agent before you trust us with customers.

The important point is not that CallingBox has every checkbox. It is that the checkboxes live in one timing model. Endpointing knows about the transcript. Barge-in controls the TTS buffer. Tool calls are designed not to block speech. Structured returns happen after the call, so JSON generation does not slow the live conversation.

Per connected minute on CallingBox

$0.05

Telephony, ASR, LLM, TTS, orchestration, AMD, STIR/SHAKEN attestation, recordings, transcripts, and structured webhooks.

DIY voice stack vs CallingBox

The clean comparison is not features. It is time to a reliable call, number of systems you own, and whether latency gets better or worse as traffic grows.

Criterion	DIY stitched stack	CallingBox
Time to first real call	4 to 8 weeks for an experienced team	Minutes
Vendors	SIP, ASR, LLM, TTS, storage, observability	One API
Pricing	Multiple invoices plus engineering time	$0.05/min, all-in
Latency ownership	Spread across every vendor boundary	One integrated budget
Barge-in	Usually custom media-buffer work	~20 ms
Telephony	Carrier setup, numbers, routing, attestation	Included
Structured returns	Build extraction and webhook retries	Built in
Best for	Voice infrastructure companies	Teams shipping phone agents

Production voice AI comparison. CallingBox numbers are from internal production benchmarks and public pricing.

When should you build it yourself?

You should build the whole voice stack yourself only when voice infrastructure is your core product, you have real-time audio and telephony expertise on staff, and your volume or compliance constraints justify owning carrier relationships directly.

That is a narrow group. Most teams are building an outcome: answered support calls, booked appointments, qualified leads, payment reminders, intake interviews, collections workflows, or dispatch coordination. Those teams win by launching the agent fast, measuring call outcomes, and improving the workflow, not by spending their first quarter on jitter buffers and SIP headers.

Build it yourself if you are a carrier, contact center platform, regulated enterprise with hard data-residency requirements, or a voice infrastructure company.
Build on CallingBox if you are a SaaS team, AI automation agency, founder, sales org, healthcare operator, home-services company, or support team that needs production phone agents now.

Build a CallingBox agent in minutes

A CallingBox agent is the smallest useful abstraction for a voice AI system: instructions, voice, tools, phone behavior, and the structured data you want back. You can create one from the API, attach a number, and receive call results on a webhook.

Step 1. Create the agent

Start with the outcome you want. This example creates a front-desk agent that answers questions, books appointments through a tool, and returns structured fields when the call ends.

from callingbox import Callingbox

client = Callingbox()

agent = client.agents.create(
    name="front-desk",
    voice="sonic-en-us-warm",
    instructions="Answer calls, qualify the request, and book appointments.",
    tools=[{
        "name": "book_appointment",
        "parameters": {"slot": {"type": "string", "format": "date-time"}},
    }],
    returns={
        "intent": {"type": "string"},
        "appointment_at": {"type": "string", "format": "date-time"},
        "caller_name": {"type": "string"},
    },
)

POST /v1/agents. Trimmed for clarity; use the docs for the full schema.

Step 2. Put the agent on a call

For inbound, attach the agent to a CallingBox number. For outbound, create a call with the agent ID and the destination number. CallingBox handles the media loop, voicemail detection, attestation, and webhook delivery.

call = client.calls.create(
    agent_id=agent.id,
    to="+14155550199",
    webhook_url="https://acme.dev/calls",
    context={"customer": {"first_name": "Maria"}},
)

POST /v1/calls. One call request starts the voice AI system.

Step 3. Receive the result

When the call ends, your webhook receives the transcript, recording, tool calls, and the structured fields you requested. That payload is the product integration point.

{
  "call_id": "call_01HX...",
  "duration_sec": 86,
  "status": "completed",
  "returns": {
    "intent": "booking",
    "appointment_at": "2026-04-29T15:30:00-07:00",
    "caller_name": "Maria Gomez"
  },
  "tool_calls": [{ "name": "book_appointment", "ok": true }],
  "recording_url": "https://recordings.callingbox.io/call_01HX.mp3",
  "transcript_url": "https://recordings.callingbox.io/call_01HX.json"
}

The webhook payload CallingBox sends after the call.

That is the core loop: define the agent, start or receive the call, consume the result. The rest of the voice AI system stays inside CallingBox.

The voice AI checklist that matters

The best voice AI systems are built around measurable call behavior, not prompt length. Before you ship, check the parts callers and operators actually notice.

Latency: median response under 500 ms and P95 low enough that the agent does not feel hesitant.
Barge-in: the agent stops speaking when the caller interrupts, ideally within one RTP frame.
Tool use: data dips have timeouts, filler speech, and fallbacks so the line never goes dead.
Structured outputs: the call returns clean JSON after completion instead of forcing the live conversation into JSON mode.
Compliance: attestation, consent, DNC behavior, call recording, audit logs, and webhook retries are designed in from the start.
Observability: every call has aligned audio, transcript, model events, tool calls, and outcomes.

CallingBox gives you those defaults on day one. That is why the best way to build a voice AI system is to start with CallingBox, launch a real agent in minutes, then spend your engineering time on the workflow your customers actually pay for.

Where to go from here

Read the docs and make your first CallingBox agent.
Check pricing for the all-in per-minute math, including free credits.
Build an AI phone answering service if your first use case is inbound.
Build an AI cold caller if your first use case is outbound.
Compare CallingBox to Vapi and Retell if you are choosing a platform.

About the author

Jonathan Chavez · Co-Founder, CallingBox (YC S25)

Co-Founder at CallingBox. Building the API for AI phone calls.

@callingbox

The best Vapi alternatives for voice agents in 2026
A side-by-side of CallingBox, Retell AI, and Vapi across pricing, latency, turn-taking, telephony, and compliance: the criteria that decide whether a voice agent ships to production.
Jonathan Chavez · 6 min read

Skip the build

Your first AI phone call in 60 seconds. Built so you don't have to.

Telephony, ASR, LLM, TTS, and structured returns: one API, $0.05 per connected minute, all-in. Outbound and inbound on the same agent. $5 in free credits, no card.

Start for free →Read the docs

#What is the best way to build a voice AI system?

#What a production voice AI system needs

#Why stitching voice APIs together fails

#How CallingBox builds the system for you

#DIY voice stack vs CallingBox

#When should you build it yourself?

#Build a CallingBox agent in minutes

#Step 1. Create the agent

#Step 2. Put the agent on a call

#Step 3. Receive the result

#The voice AI checklist that matters

#Where to go from here

The best Vapi alternatives for voice agents in 2026