voice ai10 min read

How to build an AI cold caller that books meetings

Building an AI cold caller is a system architecture problem, not a prompt problem. The four hard layers (answering machine detection, dialing strategy, latency, and CRM data dips), plus a build-vs-buy framework.

Jonathan Chavez

Co-Founder, CallingBox (YC S25)

Apr 24, 2026

An AI cold caller that books meetings is a system architecture problem, not a prompt problem. The model decides what to say. The architecture decides whether the call ever reaches a human, whether it stays on the line when it does, and whether a meeting actually lands on a calendar at the end.

Below, we walk through the four architectural layers that decide whether an AI cold caller books meetings or just makes noise: answering machine detection, dialing strategy, outbound latency, and live CRM data dips. We wrote this for sales orgs and developers building outbound lead-gen bots, and it reflects what we ship every day on CallingBox.

What is an AI cold caller?

An AI cold caller is an outbound voice agent that dials a list of numbers, qualifies the person who picks up, books a meeting on a live calendar, and returns structured CRM data when the call ends. The hard part is everything between "dial" and "book a meeting": most dialed numbers go to voicemail, most pickups are wrong-number or gatekeeper, and the small fraction that reach the intended human do so under the worst latency conditions in voice AI.

Concretely, a working AI cold caller does five things every dial:

Dials the next number in the list at a pace the carrier will accept and TCPA will allow.
Detects whether the answer is a human or a machine within the first 1.5 seconds of audio.
Talks at conversational latency to the human, or leaves a pre-recorded voicemail to the machine.
Acts mid-call by reading and writing to the CRM, checking calendar availability, and booking the slot.
Returns a structured payload (intent, booked time, CRM updates, recording) to your webhook when the call ends.

Why prompt engineering won't book meetings

Most teams approaching outbound voice for the first time spend their first month tuning the system prompt. It is the wrong dial. A perfect prompt with a misclassifying AMD model leaves voicemails for humans and reads scripts to answering machines. A perfect prompt with a naive dialer abandons calls until the carrier blocks the trunk. A perfect prompt that awaits a synchronous CRM lookup mid-sentence dies of awkward silence.

Below are the four architectural layers that decide whether an AI cold caller actually performs. Each one looks like an infrastructure footnote until it lands in front of a paying customer.

Layer 1. Answering machine detection

Answering machine detection (AMD) is the hardest classifier in voice AI, and the one that decides whether the rest of your stack ever gets to talk to a human. A miss in either direction is directly tied to revenue.

Predicted human

Predicted machine

Actual human

96.3%

3.7%

Lost meeting

Actual machine

1.4%

Wasted turn

98.6%

AMD precision and recall, internal eval set · 12,000 US outbound dials · 2026

A false negative (we predict machine, the line is actually a human) costs a booked meeting outright: the agent silently hangs up or starts leaving a voicemail to a confused person. A false positive (we predict human, the line is actually a machine) is cheaper but compounds: the agent talks at a beep, the dialer wastes a slot, and your effective dials-per-hour falls.

The naive approaches (silence detection, "BEEP" keyword spotting, fixed-window heuristics) are not in the same ZIP code as the numbers above. We run a small classifier trained on the first 2 to 3 seconds of greeting audio, in parallel with ASR, and treat the AMD output as a routing decision rather than a side-effect. Until that classifier ships, nothing else in the stack matters.

Layer 2. Dialing strategy that respects TCPA

Outbound dialing is a queueing problem with regulatory teeth. You are choosing how many calls to dial in parallel for each available agent (a virtual one in our case), how aggressively to abandon unanswered rings, and how to pace the dialer so the carrier does not block the trunk for spam-like behavior.

Power dialing(1:1 dial-to-agent) wastes most of the agent's time waiting for ring, voicemail, and gatekeepers. Easy to operate, terrible throughput.
Progressive dialing (n:1 with a low ratio, typically 1.5–2:1) doubles throughput at the cost of occasional early-hangup events when two calls connect at once. Reasonable starting point.
Predictive dialing (n:1 with a model-driven ratio, often 3–5:1) maxes throughput by predicting when an agent will free up. The classic mode for human call centers, and the one with the most regulatory exposure.

Under the US Telephone Consumer Protection Act, the abandonment rate (calls answered but dropped because no agent was free within 2 seconds) is capped at 3%of live answers, measured per 30-day campaign. The catch with AI agents is that the abandonment math doesn't magically vanish: virtual agents have finite capacity (orchestrator slots, model concurrency, audio I/O budget), and oversubscribing them creates the same dropped-call pattern. The dialer has to know the real concurrency ceiling and back off when it's near the limit.

Layer 3. Latency on outbound is harder than inbound

The 500ms median end-to-end response budget that defines a natural-feeling AI phone call is harder to hit on outbound than on inbound, and most teams discover this only in production. Two reasons.

Codec gymnastics.An outbound call originated by your carrier traverses one or more transit hops before it reaches the recipient's carrier. Each hop is a chance for transcoding (G.711 ↔ G.722 ↔ Opus, sometimes round-trip), which adds 10 to 40 ms of buffering and a hit to perceptual quality. Inbound calls from a recipient's carrier into your SBC usually have one fewer hop and one fewer transcode.

Different jitter profile.The recipient's last-mile carrier owns the jitter buffer that decides what your agent sounds like on their phone. Mobile last-mile is more jittery than landline last-mile, and on outbound you do not get to pick which the recipient is on. The same 490 ms wall-clock budget that feels snappy on inbound feels closer to 600 ms on a congested mobile last-mile.

Inbound latency is what you measure. Outbound latency is what the recipient's carrier hands you.

The outbound rule

The mitigation is the same shape as the inbound playbook (stream every stage, never block speech on JSON mode, sentence-incremental TTS) but tuned harder. We co-locate the SBC, media server, and inference pods in the same region; we peer directly with the major US carriers to keep transit hops to one; and we run a per-call jitter probe that adapts the playout buffer in the first second of speech.

Layer 4. CRM data dips without blocking speech

The bug every team ships in week one: a tool call wired as a synchronous awaitin the middle of a turn. The agent says "Let me pull up your account", the tool call takes 900 ms to round-trip Salesforce, and the recipient hears 900 ms of dead air before the next sentence starts. Two of those in a row and the call is over.

The architecture for live data dips is three rules:

Filler is part of the protocol.The moment the agent commits to a tool call, it speaks a short, model-generated filler ("One sec while I look that up") so the line never goes silent. The filler is a first-class part of the tool schema, not a happy accident of the prompt.
Tool calls have hard timeouts and fallbacks. Every external dip (CRM lookup, calendar availability, DNC check) has a deadline measured in milliseconds, not seconds. On timeout, the agent has a pre-written fallback ("I am having trouble pulling that up. Can I call you back in 5 minutes?") that books the next action without lying about state.
Writes are idempotent and async. Booking a calendar slot, updating a CRM stage, or logging a disposition must not block the close of the call. Queue the write, return the structured payload, and let the webhook reconcile.

These rules are unglamorous, and they are also the difference between a demo that works on stage and a fleet of cold callers that books meetings every weekday.

Per dialed minute, all-in

$0.05

Telephony, AMD, ASR, LLM, TTS, orchestration, CRM dips, and the structured webhook ship as one product on CallingBox. STIR/SHAKEN A-attestation included.

Build vs buy: a decision framework

For an outbound AI cold caller, the wall is not the LLM. The wall is AMD accuracy, carrier relations, and TCPA-compliant pacing. Those are 8 to 12 weeks of work for a senior engineer who has shipped real-time voice and outbound telephony before, and 4 to 6 months for a team that has not.

Build it yourself if you have on-staff telephony expertise, a compliance constraint that blocks managed providers, or are forecasting more than ~5 million dialed minutes per year.
Buy an API if you are a sales org, an AI automation agency shipping outbound bots for clients, or a founder validating an outbound product. The per-minute price is the smaller line item; the AMD model, carrier paperwork, and dialer-pacing math are the larger ones.
Buy then build if you have product-market fit on a managed API and the math on bringing the stack in-house clears a 12-month payback at your real volume.

Most sales orgs and AAAs fall into the second bucket. You are selling outcomes (booked meetings, qualified leads, closed-loop CRM data), not infrastructure. Your customer does not care which SIP trunk you used.

The 60-second outbound version with CallingBox

On CallingBox, an AI cold caller is one agent, one dial endpoint, one webhook. The same agent definition that answers inbound calls dispatches outbound calls; you choose by calling /v1/calls with a target number.

One dispatch · AMD is the only fork

Step 1. Configure the agent for outbound

The agent is the same primitive as inbound, with two outbound additions: an opener (the first sentence the agent speaks when the call connects) and a voicemail_script (what the agent leaves when AMD returns "machine").

from callingbox import Callingbox

client = Callingbox()

agent = client.agents.create(
    name="ae-outbound",
    voice="sonic-en-us-warm",
    opener="Hi, is this {{lead.first_name}}? Ava from Acme.",
    voicemail_script="Hi {{lead.first_name}}, Ava from Acme. Try you tomorrow.",
    tools=[{"name": "book_meeting", "parameters": {"slot": {"type": "string", "format": "date-time"}}}],
    returns={"outcome": {"type": "string"}, "meeting_at": {"type": "string", "format": "date-time"}},
)

POST /v1/agents: same agent shape as inbound, plus the two outbound additions. Trimmed for clarity; full schema in the docs.

Step 2. Dispatch an outbound call

Each POST /v1/calls dispatches one dial. CallingBox runs AMD on the answer, routes to the agent on a human pickup, and leaves the voicemail script on a machine. TCPA-aware pacing, DNC checks, and STIR/SHAKEN attestation are handled by default.

call = client.calls.create(
    agent_id="agt_ae_outbound",
    to="+14155550199",
    webhook_url="https://acme.dev/calls",
    context={"lead": {"first_name": "Maria"}},
)

POST /v1/calls: one dial, one webhook, with lead context the agent can read mid-call.

Step 3. Receive structured outcomes

When the call ends, CallingBox posts the structured payload you defined in returns, alongside the AMD verdict, the transcript, and the recording. Pipe it straight into your CRM.

{
  "call_id":      "call_01HX…",
  "duration_sec": 142,
  "amd":          "human",
  "returns": {
    "outcome":    "meeting_booked",
    "meeting_at": "2026-04-29T15:30:00-07:00"
  },
  "tool_calls":     [{ "name": "book_meeting", "ok": true }],
  "recording_url":  "https://recordings.callingbox.io/call_01HX….mp3",
  "transcript_url": "https://recordings.callingbox.io/call_01HX….json"
}

The call summary CallingBox delivers to your webhook on completion.

That is the entire outbound integration. No AMD model to train, no dialer to pace, no STIR/SHAKEN paperwork, no DNC scrubbing cron. Telephony, AMD, ASR, LLM, TTS, orchestration, and compliance ship as one product at $0.05 per dialed minute, with $5 in free credits to dial yourself before you trust us.

Where to go from here

Best way to build a voice AI system for the full production architecture behind inbound and outbound agents.
How to build an AI phone answering service in 2026. The inbound counterpart to this guide.
Pricing for the per-minute math at your dial volume.
Docs for the full agent, dialer, and webhook reference.

About the author

Jonathan Chavez · Co-Founder, CallingBox (YC S25)

Co-Founder at CallingBox. Building the API for AI phone calls.

@callingbox

The best Vapi alternatives for voice agents in 2026
A side-by-side of CallingBox, Retell AI, and Vapi across pricing, latency, turn-taking, telephony, and compliance: the criteria that decide whether a voice agent ships to production.
Jonathan Chavez · 6 min read

Skip the build

Your first AI phone call in 60 seconds. Built so you don't have to.

Telephony, ASR, LLM, TTS, and structured returns: one API, $0.05 per connected minute, all-in. Outbound and inbound on the same agent. $5 in free credits, no card.

Start for free →Read the docs

#What is an AI cold caller?

#Why prompt engineering won't book meetings

#Layer 1. Answering machine detection

#Layer 2. Dialing strategy that respects TCPA

#Layer 3. Latency on outbound is harder than inbound

#Layer 4. CRM data dips without blocking speech

#Build vs buy: a decision framework

#The 60-second outbound version with CallingBox

#Step 1. Configure the agent for outbound

#Step 2. Dispatch an outbound call

#Step 3. Receive structured outcomes

#Where to go from here

The best Vapi alternatives for voice agents in 2026