How to build an AI phone answering service in 2026
Two paths to an AI phone answering service in 2026. The DIY stack: six layers, four to six vendors, weeks of carrier paperwork. The CallingBox path: one agent, one number, one webhook.

Jonathan Chavez
Co-Founder, CallingBox (YC S25)
An AI phone answering service is six systems pretending to be one: a SIP trunk, voice-activity detection, streaming speech-to-text, an LLM, streaming text-to-speech, and an orchestrator stitching them together in real time. If any one of them stalls for more than ~150 ms, the caller starts talking again, and the conversation collapses into cross-talk.
There are two honest ways to ship one in 2026. You can assemble the six layers yourself from a handful of vendors, eat the carrier paperwork, and write the orchestrator that holds the 500 ms budget under load. Or you can hit one API. This post walks through both paths end to end: what the DIY stack actually contains, what it costs per minute, the pitfalls nobody warns you about. Then it shows the same product on CallingBox in two API calls.
What is an AI phone answering service?
It is an inbound voice agent that answers a real phone number, holds a natural conversation with the caller, executes tools mid-call (look up an account, check availability, book an appointment), and returns structured data when the call ends.
Concretely, a working AI phone answering service does five things on every call:
- Receives the inbound SIP INVITE from the carrier and accepts the call.
- Listens to the caller and decides when they have finished speaking.
- Responds in fluent, low-latency speech with context from the conversation so far.
- Acts by calling external tools (CRMs, booking APIs, internal databases) when the conversation requires it.
- Returns a structured JSON payload (caller intent, collected fields, outcome) to your webhook when the call ends.
The first three are the conversation layer. The last two are why anyone is paying for this in the first place: an inbound agent that only chats is a demo, an inbound agent that captures structured outcomes is a product.
What's actually inside one
Every production AI phone answering service is the same six layers in the same order, streaming into one another so no stage waits for the previous one to finish.
The dashed handoffs are where the latency budget is actually won. The recognition layer streams partial transcripts straight into the reasoning layer, and the reasoning layer streams tokens straight into synthesis, so the caller hears the first phoneme of the answer before the model has finished thinking. The outbound counterpart, where answering machine detection and TCPA-compliant pacing sit in front of these same six layers, lives in how to build an AI cold caller that books meetings.
The DIY path: what it actually takes
Six layers means six integrations: a telephony provider for the carrier hop, a streaming ASR vendor, a low-latency LLM, a streaming TTS, an orchestrator that holds the timing, and the recording / observability plumbing around all of it. Each one is a contract, an SLA, an on-call rotation, and a line item on your bill. Below is what that looks like at the per-minute level, and the five problems that show up in production.
What it costs per connected minute
A realistic per-minute cost breakdown for an inbound AI phone agent assembled from best-of-breed providers in 2026. Numbers are list price, no committed-volume discounts, US-domestic inbound.
Per-minute cost · DIY stack
USD · 2026 list price
Telephony (inbound + media stream)
$0.0140
Streaming speech-to-text
$0.0100
LLM reasoning (~4 turns / min)
$0.0180
Streaming text-to-speech
$0.0200
Orchestration compute + observability
$0.0120
Voicemail detection + recording storage
$0.0060
DIY total
per connected inbound minute, list price
$0.0800
Other voice-agent platforms
managed alternatives, all-in list price range
$0.12 – $0.15
CallingBox
all of the above, one bill, $5 in free credits
$0.0500
The middle row is where managed voice-agent platforms typically land once you add their orchestration fee on top of the same six layers. They charge more than DIY because they package the stack and take a margin; we charge less because we own it end to end on infrastructure we built.
Two things this table does not include. First, the phone number rental itself, typically $1 to $5 per month per US local number, which is rounding error at any real volume. Second, and this is the line item that actually decides build versus buy: engineering time.
We've watched a working DIY stack take 4 to 8 weeks for a senior engineer who has shipped real-time audio before, and 3 to 6 months for a team that hasn't. Add another month for carrier attestation onboarding, voicemail-detection accuracy work, and the long tail of edge cases that only show up in production. At a fully-loaded engineering cost of $200 / hour, the first 100,000 connected minutes routinely cost more in salary than in per-minute fees.
The five pitfalls nobody warns you about
The cost table assumes everything works. In practice, five problems eat the first month of every DIY voice-agent project we've watched ship. Each one looks small until it lands in front of a paying customer.
01Endpointing eats your latency budget
Acoustic VAD waits for ~700 ms of silence before declaring a turn over. That alone destroys the conversational budget. The fix is a dual-signal endpointer: acoustic VAD running per RTP frame, plus a small semantic-completion classifier that scores whether the partial ASR transcript looks like a finished utterance. Get this wrong and callers get cut off mid-sentence on every pause for breath.
02Barge-in has to be near-instant
When a caller interrupts the agent mid-sentence, the agent must yield the floor within one RTP frame, ~20 ms. Every homegrown orchestrator we've benchmarked takes 200 to 400 ms because it drains the TTS buffer first, which makes the agent feel rude and cuts retention in half on anything longer than a one-shot Q&A.
03Voicemail detection is a classifier, not a heuristic
On inbound calls this matters less than on outbound, but you still need to detect when a forwarded line lands you in someone's voicemail. Naïve approaches (silence detection, "BEEP" keyword spotting) missed ~20% of voicemails in our internal eval set of 12,000 US calls. A real voicemail detector is a small classifier trained on the first 2 to 3 seconds of greeting audio and runs in parallel with ASR.
04STIR/SHAKEN attestation is non-negotiable
US carriers automatically tag any unsigned or partially-attested call as "Spam Likely" on the recipient's screen. For inbound this matters for transfers and callbacks: the moment your agent dials out from the same number without A-level attestation, that call shows up as spam. A-level onboarding takes 2 to 6 weeks per US carrier and requires a verified business identity.
05Structured returns must not block speech
The last bug everyone discovers in production: forcing the LLM into JSON-mode generation mid-conversation adds ~400 ms to every turn because the model has to commit to the full JSON shape before streaming begins. The right pattern is to keep the live conversation in plain streaming mode and run a single non-blocking extraction pass at the end of the call to produce the structured payload your webhook needs.
The CallingBox path: one agent, one webhook
The point of CallingBox is that none of the surface area above is yours. Telephony, endpointing, ASR, LLM, TTS, orchestration, carrier attestation, voicemail detection, and structured returns ship as one product on a single per-minute price. The integration is two API calls.
Per connected minute, all-in
Step 1. Create the inbound agent
The agent is a reusable configuration: persona, voice, instructions, the numbers it answers on, the tools it can call mid-conversation, and the structured-returns schema you want back when the call ends.
from callingbox import Callingbox
client = Callingbox()
agent = client.agents.create(
name="front-desk",
type="inbound",
voice="sonic-en-us-warm",
instructions="Greet callers and book appointments via book_appointment.",
number_ids=["num_4155550199"],
webhook_url="https://acme.dev/calls",
tools=[{"name": "book_appointment", "parameters": {"slot": "date-time"}}],
returns={"intent": "string", "appointment_at": "date-time", "caller_name": "string"},
)Step 2. Receive structured data on the webhook
Every inbound call to the attached number now routes through the agent. When the call ends, your webhook receives the structured payload you defined in returns, alongside the transcript, the recording URL, and any tool calls the agent made.
{
"call_id": "call_01HX...",
"from": "+14085550144",
"duration_sec": 86,
"returns": {
"intent": "booking",
"appointment_at": "2026-04-29T15:30:00-07:00",
"caller_name": "Maria Gomez"
},
"tool_calls": [{ "name": "book_appointment", "ok": true }],
"recording_url": "https://recordings.callingbox.io/call_01HX....mp3"
}That is the entire integration. No SIP trunk to provision, no ASR provider to vendor, no carrier attestation paperwork, no voicemail-detection classifier to train, no orchestrator to keep within the 500 ms budget under load.
DIY vs CallingBox, side by side
The same product, the same call, two different surface areas. Below is the honest comparison on the criteria that actually decide whether an inbound voice agent ships.
| Criterion | Build it yourself | Build on CallingBox |
|---|---|---|
| Time to first call | 4 to 8 weeks (senior eng) | 60 seconds |
| Per connected minute | ~$0.08 + engineering time | $0.05, all-in |
| Vendors to integrate | 4 to 6, separate bills | 1, one bill |
| Carrier attestation | 2 to 6 weeks per US carrier | Pre-attested, A-level |
| Endpointing | Build a dual-signal classifier | Built in |
| Barge-in latency | 200 to 400 ms typical | ~20 ms (one RTP frame) |
| Voicemail detection | Train and maintain a classifier | Built in |
| Structured returns | Hand-roll a non-blocking pass | One returns schema field |
| Observability | Your own tracing + recordings | Dashboard + transcripts included |
Build vs buy: when DIY actually wins
Buying an API wins on time-to-first-call for almost every team. DIY wins in two narrow situations: you have a compliance or data-residency requirement that rules out a managed provider, or your call volume is large enough that a cent of margin per minute funds a full-time platform team.
The cleanest way to decide:
- Build it yourself if you have on-staff real-time audio expertise, a compliance constraint that blocks managed providers, or are forecasting more than ~3 million connected minutes per year.
- Buy an API if you are an AI automation agency shipping inbound bots for clients, a SaaS adding voice as a feature, or a founder validating a voice-first product. The per-minute price is the smaller line item; the engineering and carrier-relations time is the larger one.
- Buy then build if you have product-market fit on a managed API and the math on bringing the stack in-house clears a 12-month payback at your real volume.
Most agencies and founders fall into the second bucket. You are selling outcomes (booked appointments, qualified leads, answered FAQs), not infrastructure, and your customer does not care which carrier the call came in on.
Where to go from here
- Best way to build a voice AI system for the full architecture and build-vs-buy framework.
- Pricing for the per-minute math at your volume.
- Docs for the full agent, tool, and webhook reference.
- How to build an AI cold caller that books meetings for the outbound counterpart, where answering machine detection and TCPA-compliant pacing sit in front of the same six layers.

About the author
Jonathan Chavez · Co-Founder, CallingBox (YC S25)
Co-Founder at CallingBox. Building the API for AI phone calls.
Continue reading
Skip the build
Your first AI phone call in 60 seconds. Built so you don't have to.
Telephony, ASR, LLM, TTS, and structured returns: one API, $0.05 per connected minute, all-in. Outbound and inbound on the same agent. $5 in free credits, no card.