Backed by Y Combinator
Blog
voice ai11 min read

How to build an AI phone answering service in 2026

Two paths to an AI phone answering service in 2026. The DIY stack: six layers, four to six vendors, weeks of carrier paperwork. The CallingBox path: one agent, one number, one webhook.

Jonathan Chavez

Jonathan Chavez

Co-Founder, CallingBox (YC S25)

An AI phone answering service is six systems pretending to be one: a SIP trunk, voice-activity detection, streaming speech-to-text, an LLM, streaming text-to-speech, and an orchestrator stitching them together in real time. If any one of them stalls for more than ~150 ms, the caller starts talking again, and the conversation collapses into cross-talk.

There are two honest ways to ship one in 2026. You can assemble the six layers yourself from a handful of vendors, eat the carrier paperwork, and write the orchestrator that holds the 500 ms budget under load. Or you can hit one API. This post walks through both paths end to end: what the DIY stack actually contains, what it costs per minute, the pitfalls nobody warns you about. Then it shows the same product on CallingBox in two API calls.

What is an AI phone answering service?

It is an inbound voice agent that answers a real phone number, holds a natural conversation with the caller, executes tools mid-call (look up an account, check availability, book an appointment), and returns structured data when the call ends.

Concretely, a working AI phone answering service does five things on every call:

  1. Receives the inbound SIP INVITE from the carrier and accepts the call.
  2. Listens to the caller and decides when they have finished speaking.
  3. Responds in fluent, low-latency speech with context from the conversation so far.
  4. Acts by calling external tools (CRMs, booking APIs, internal databases) when the conversation requires it.
  5. Returns a structured JSON payload (caller intent, collected fields, outcome) to your webhook when the call ends.

The first three are the conversation layer. The last two are why anyone is paying for this in the first place: an inbound agent that only chats is a demo, an inbound agent that captures structured outcomes is a product.

What's actually inside one

Every production AI phone answering service is the same six layers in the same order, streaming into one another so no stage waits for the previous one to finish.

The dashed handoffs are where the latency budget is actually won. The recognition layer streams partial transcripts straight into the reasoning layer, and the reasoning layer streams tokens straight into synthesis, so the caller hears the first phoneme of the answer before the model has finished thinking. The outbound counterpart, where answering machine detection and TCPA-compliant pacing sit in front of these same six layers, lives in how to build an AI cold caller that books meetings.

The DIY path: what it actually takes

Six layers means six integrations: a telephony provider for the carrier hop, a streaming ASR vendor, a low-latency LLM, a streaming TTS, an orchestrator that holds the timing, and the recording / observability plumbing around all of it. Each one is a contract, an SLA, an on-call rotation, and a line item on your bill. Below is what that looks like at the per-minute level, and the five problems that show up in production.

What it costs per connected minute

A realistic per-minute cost breakdown for an inbound AI phone agent assembled from best-of-breed providers in 2026. Numbers are list price, no committed-volume discounts, US-domestic inbound.

Per-minute cost · DIY stack

USD · 2026 list price

  • Telephony (inbound + media stream)

    $0.0140

  • Streaming speech-to-text

    $0.0100

  • LLM reasoning (~4 turns / min)

    $0.0180

  • Streaming text-to-speech

    $0.0200

  • Orchestration compute + observability

    $0.0120

  • Voicemail detection + recording storage

    $0.0060

DIY total

per connected inbound minute, list price

$0.0800

Other voice-agent platforms

managed alternatives, all-in list price range

$0.12 – $0.15

CallingBox

all of the above, one bill, $5 in free credits

$0.0500

Engineering time, carrier paperwork, and the long tail of edge cases are not included in any row.

The middle row is where managed voice-agent platforms typically land once you add their orchestration fee on top of the same six layers. They charge more than DIY because they package the stack and take a margin; we charge less because we own it end to end on infrastructure we built.

Two things this table does not include. First, the phone number rental itself, typically $1 to $5 per month per US local number, which is rounding error at any real volume. Second, and this is the line item that actually decides build versus buy: engineering time.

We've watched a working DIY stack take 4 to 8 weeks for a senior engineer who has shipped real-time audio before, and 3 to 6 months for a team that hasn't. Add another month for carrier attestation onboarding, voicemail-detection accuracy work, and the long tail of edge cases that only show up in production. At a fully-loaded engineering cost of $200 / hour, the first 100,000 connected minutes routinely cost more in salary than in per-minute fees.

The five pitfalls nobody warns you about

The cost table assumes everything works. In practice, five problems eat the first month of every DIY voice-agent project we've watched ship. Each one looks small until it lands in front of a paying customer.

01Endpointing eats your latency budget

Acoustic VAD waits for ~700 ms of silence before declaring a turn over. That alone destroys the conversational budget. The fix is a dual-signal endpointer: acoustic VAD running per RTP frame, plus a small semantic-completion classifier that scores whether the partial ASR transcript looks like a finished utterance. Get this wrong and callers get cut off mid-sentence on every pause for breath.

02Barge-in has to be near-instant

When a caller interrupts the agent mid-sentence, the agent must yield the floor within one RTP frame, ~20 ms. Every homegrown orchestrator we've benchmarked takes 200 to 400 ms because it drains the TTS buffer first, which makes the agent feel rude and cuts retention in half on anything longer than a one-shot Q&A.

03Voicemail detection is a classifier, not a heuristic

On inbound calls this matters less than on outbound, but you still need to detect when a forwarded line lands you in someone's voicemail. Naïve approaches (silence detection, "BEEP" keyword spotting) missed ~20% of voicemails in our internal eval set of 12,000 US calls. A real voicemail detector is a small classifier trained on the first 2 to 3 seconds of greeting audio and runs in parallel with ASR.

04STIR/SHAKEN attestation is non-negotiable

US carriers automatically tag any unsigned or partially-attested call as "Spam Likely" on the recipient's screen. For inbound this matters for transfers and callbacks: the moment your agent dials out from the same number without A-level attestation, that call shows up as spam. A-level onboarding takes 2 to 6 weeks per US carrier and requires a verified business identity.

05Structured returns must not block speech

The last bug everyone discovers in production: forcing the LLM into JSON-mode generation mid-conversation adds ~400 ms to every turn because the model has to commit to the full JSON shape before streaming begins. The right pattern is to keep the live conversation in plain streaming mode and run a single non-blocking extraction pass at the end of the call to produce the structured payload your webhook needs.

The CallingBox path: one agent, one webhook

The point of CallingBox is that none of the surface area above is yours. Telephony, endpointing, ASR, LLM, TTS, orchestration, carrier attestation, voicemail detection, and structured returns ship as one product on a single per-minute price. The integration is two API calls.

Per connected minute, all-in

$0.05
Telephony, ASR, LLM, TTS, orchestration, carrier attestation, voicemail detection, and structured returns. One bill, $5 in free credits.

Step 1. Create the inbound agent

The agent is a reusable configuration: persona, voice, instructions, the numbers it answers on, the tools it can call mid-conversation, and the structured-returns schema you want back when the call ends.

from callingbox import Callingbox

client = Callingbox()

agent = client.agents.create(
    name="front-desk",
    type="inbound",
    voice="sonic-en-us-warm",
    instructions="Greet callers and book appointments via book_appointment.",
    number_ids=["num_4155550199"],
    webhook_url="https://acme.dev/calls",
    tools=[{"name": "book_appointment", "parameters": {"slot": "date-time"}}],
    returns={"intent": "string", "appointment_at": "date-time", "caller_name": "string"},
)
POST /v1/agents: one call to create the agent, attach the number, and bind the webhook. Trimmed for clarity; full schema in the docs.

Step 2. Receive structured data on the webhook

Every inbound call to the attached number now routes through the agent. When the call ends, your webhook receives the structured payload you defined in returns, alongside the transcript, the recording URL, and any tool calls the agent made.

{
  "call_id":       "call_01HX...",
  "from":          "+14085550144",
  "duration_sec":  86,
  "returns": {
    "intent":         "booking",
    "appointment_at": "2026-04-29T15:30:00-07:00",
    "caller_name":    "Maria Gomez"
  },
  "tool_calls":    [{ "name": "book_appointment", "ok": true }],
  "recording_url": "https://recordings.callingbox.io/call_01HX....mp3"
}
POST https://acme.dev/calls: the call summary CallingBox delivers to your webhook when a call ends.

That is the entire integration. No SIP trunk to provision, no ASR provider to vendor, no carrier attestation paperwork, no voicemail-detection classifier to train, no orchestrator to keep within the 500 ms budget under load.

DIY vs CallingBox, side by side

The same product, the same call, two different surface areas. Below is the honest comparison on the criteria that actually decide whether an inbound voice agent ships.

CriterionBuild it yourselfBuild on CallingBox
Time to first call4 to 8 weeks (senior eng)60 seconds
Per connected minute~$0.08 + engineering time$0.05, all-in
Vendors to integrate4 to 6, separate bills1, one bill
Carrier attestation2 to 6 weeks per US carrierPre-attested, A-level
EndpointingBuild a dual-signal classifierBuilt in
Barge-in latency200 to 400 ms typical~20 ms (one RTP frame)
Voicemail detectionTrain and maintain a classifierBuilt in
Structured returnsHand-roll a non-blocking passOne returns schema field
ObservabilityYour own tracing + recordingsDashboard + transcripts included
Same product, same call · DIY assumes a senior engineer who has shipped real-time audio before

Build vs buy: when DIY actually wins

Buying an API wins on time-to-first-call for almost every team. DIY wins in two narrow situations: you have a compliance or data-residency requirement that rules out a managed provider, or your call volume is large enough that a cent of margin per minute funds a full-time platform team.

The cleanest way to decide:

  • Build it yourself if you have on-staff real-time audio expertise, a compliance constraint that blocks managed providers, or are forecasting more than ~3 million connected minutes per year.
  • Buy an API if you are an AI automation agency shipping inbound bots for clients, a SaaS adding voice as a feature, or a founder validating a voice-first product. The per-minute price is the smaller line item; the engineering and carrier-relations time is the larger one.
  • Buy then build if you have product-market fit on a managed API and the math on bringing the stack in-house clears a 12-month payback at your real volume.

Most agencies and founders fall into the second bucket. You are selling outcomes (booked appointments, qualified leads, answered FAQs), not infrastructure, and your customer does not care which carrier the call came in on.

Where to go from here


Jonathan Chavez

About the author

Jonathan Chavez · Co-Founder, CallingBox (YC S25)

Co-Founder at CallingBox. Building the API for AI phone calls.

Continue reading

Skip the build

Your first AI phone call in 60 seconds. Built so you don't have to.

Telephony, ASR, LLM, TTS, and structured returns: one API, $0.05 per connected minute, all-in. Outbound and inbound on the same agent. $5 in free credits, no card.