Your first month is free.First month free on any plan.Thanks to the Deepgram for Startups program. Use codeStart free with DEEPGRAM
how AI navigates phone tree

How AI Phone Agents Navigate IVR Phone Trees (Without Pressing the Wrong Button)

An AI phone agent navigates an IVR phone tree by listening to the menu prompt with real-time speech-to-text, matching the spoken options against the goal of the call, then sending DTMF tones or spoken responses with human-like timing — typically waiting for the prompt to finish, pausing a beat, then pressing one digit at a time. It repeats this loop at each menu level, waits silently on hold when transferred, and drops the navigation state machine the moment a human picks up so the conversation starts cleanly. This guide walks through that workflow end-to-end, the failure modes that trip up naive implementations, and a worked example of an AI agent rescheduling a dermatology appointment through a five-level phone tree.

Try ClawCall free — 30 calls + 30 min, no card →

What an IVR phone tree actually is (and why it is hard for AI)

An IVR — Interactive Voice Response — is the menu system you hear when you call a bank, an airline, a doctor's office, or a utility. It plays a recorded prompt ("Press 1 for billing, press 2 for appointments") and listens for either DTMF tones from your keypad or spoken commands. Traditional IVRs are decision trees: each input narrows the path until you reach a queue, a self-service flow, or eventually a live agent. The trees are deeper and weirder than they look. A typical airline IVR has five to seven levels, branches on whether you have a confirmation number, and routes differently if your flight is within 24 hours. A pharmacy IVR may demand a date of birth before it will even tell you the hours. The structure is not published, the prompts change without notice, and the timing windows for accepting input are inconsistent across vendors. For an AI phone agent, three things make this hard. First, the menus play as audio, so the agent needs real-time speech-to-text accurate enough to catch option numbers spoken at speed over a compressed phone line. Second, many IVRs ignore digits sent while the prompt is still playing, and others drop digits sent faster than a human thumb could press them — so timing matters as much as accuracy. Third, the agent has to know when to stop being an IVR navigator and start being a conversationalist, because somewhere in the tree a human will eventually pick up. Getting any of these wrong sends you to the wrong department, or worse, dumps you into a voicemail box you cannot escape from.

The end-to-end workflow an AI agent runs at each menu level

When an AI phone agent reaches an IVR, it runs a tight loop that mirrors what a careful human caller would do. Step one is to listen. The agent streams the call audio into a low-latency speech-to-text model and waits for the prompt to finish speaking. It does not interrupt — talking over the prompt or sending tones early is the single most common reason naive agents get routed to the wrong place. Step two is to match. The transcript of the prompt is passed to a language model along with the goal of the call ("reschedule a dermatology appointment for May 14"). The model picks the option whose described outcome best fits the goal — not necessarily the literal keyword, since IVRs often phrase the same intent five different ways ("existing appointments," "manage your visit," "appointment services"). Step three is to act. The agent calls a dedicated DTMF tool to send the chosen digit, then pauses a beat before sending the next one if a sequence like an account number is required. Step four is to advance. The agent listens for the next prompt and either repeats the loop at the next menu level, recognises that it has been put into a hold queue, or recognises that a human has picked up. The whole loop runs in seconds, and across deep menus the agent will execute it five or six times before reaching a human or completing the task entirely inside the IVR. Each iteration is independent, so a misroute at level three does not corrupt levels one and two — the agent can detect the wrong department by listening to the next prompt and either backtrack or escalate.

DTMF timing: the subtle problem that breaks most naive agents

DTMF stands for Dual-Tone Multi-Frequency — the two-tone audio signal generated when you press a key on a phone. Sending DTMF programmatically over a call sounds trivial: pick a digit, emit the tone, done. In practice, IVRs treat DTMF input with surprising strictness. Many enterprise IVRs use what is called barge-in detection: if a tone arrives while the menu prompt is still playing, the IVR either ignores it, restarts the menu, or in the worst case interprets it as a request to skip to a default option. Other IVRs use an inter-digit timeout — the gap between digits in a sequence — that is shorter than the default behaviour of most voice AI platforms, meaning a fast-fired account number can be split into two partial sequences that both fail validation. The Retell AI team has written publicly that getting this timing wrong sends agents to the wrong department roughly a third of the time on common enterprise IVRs, which matches what production logs consistently show. A well-built AI phone agent waits for prompt audio to fully drain, adds a configurable settle delay of a few hundred milliseconds, sends the first digit, and then paces subsequent digits at roughly human keypress cadence — fast enough not to time out, slow enough not to overflow the IVR's input buffer. This is unglamorous work, but it is the difference between a 95% success rate and a 65% success rate on real-world phone trees. The fix is also the kind of thing that gets quietly tuned over hundreds of calls against different vendors, because Cisco, Avaya, Genesys, and homegrown PBX systems each have their own quirks around DTMF acceptance windows.

What happens when the agent gets put on hold

Once the AI has navigated past the menu, it almost always lands in a hold queue. This is the part of the workflow that consumer users care about most — the whole reason "have an AI wait on hold for me" is a category at all. The agent's job here is to do nothing for a long time, intelligently. Concretely, it keeps the call open, lets the audio stream play silently in the background, and runs a lightweight classifier on the inbound audio to distinguish three states: hold music (continue waiting), recorded announcements (continue waiting, but note any timing information the IVR offers like "your estimated wait is 14 minutes"), and a live human voice (immediately transition into conversation mode). A well-built agent will also notice the failure cases: a hangup tone, a request to leave a voicemail, or a recorded message saying the office is closed for the day. ClawCall's voicemail behavior is instruction-controlled: when it detects a voicemail prompt, it leaves the approved message or reports the outcome rather than guessing about a message you did not approve. The waiting itself can be long. Real-world hold times for things like state DMV lines, the IRS, or insurance providers regularly exceed an hour. Because keeping an idle call open costs almost nothing on flat-rate pricing, hold-elimination has become the modal consumer use case for AI phone agents — the user goes back to work and gets pinged the moment a human is on the line.

The handoff: dropping the IVR state machine when a human answers

The single moment most teams underbuild is the transition from navigation to conversation. While the AI is in the phone tree, it is essentially running a state machine: matching prompts against options, tracking which level of the menu it is on, holding the original goal in working memory. The instant a human picks up, all of that has to go away. The human does not want to hear the agent finish reciting menu options, and they definitely do not want a confused agent that interprets their greeting as another IVR prompt. A clean handoff sounds like: "Hi, I'm calling on behalf of Sarah Mitchell about appointment confirmation 74821 — do you have a moment to help?" The agent has dropped the tree-navigation context, swapped in a fresh conversational prompt, and led with the identifying details the human needs to look up the account. The signal a well-built agent watches for is a short, conversational utterance — "hi, this is Maria, how can I help?" — that does not match the rhythm or vocabulary of a prerecorded prompt. When that signal arrives, any pending DTMF tool calls are suppressed and the conversation prompt takes over. There is also a disclosure question worth getting right here: if the human asks whether they are talking to a person, the honest answer is no. Being upfront about being an AI matters because some receptionists will hang up the moment they suspect a robot — and they have every right to. Disclosing before they have to ask builds the trust that gets the request actually completed.

A worked example: rescheduling a dermatology appointment

Here is what the workflow looks like end-to-end with ClawCall handling a realistic task. The user opens the web app — or sends a message to its SMS interface, or fires a POST /call from their own AI agent through the REST API at api.clawcall.dev — and provides the dermatologist's phone number plus the goal: "Reschedule my May 7 appointment with Dr. Patel to any morning slot the week of May 19; my date of birth is March 3, 1991; confirmation number A4720." The system acquires a number from its outbound pool, dials, and within two seconds the IVR picks up. The agent transcribes "For existing appointments, press 2" and waits for the prompt to finish, then sends a single 2. The next prompt asks for date of birth — the agent sends 03031991 with a 250ms inter-digit pause so the eight digits do not get split into two failed sequences. The IVR says "Please hold for the next available representative," hold music starts, and the agent waits silently. Nine minutes in, a human picks up: "Patel's office, this is Maria." The agent drops the IVR state, leads with "Hi Maria, I'm an AI assistant calling on behalf of Sarah to reschedule appointment A4720 — Sarah is hoping to move it to the morning of May 19, 20, or 21." Maria offers May 20 at 9:15am, the agent confirms, and the call ends. The user sees the new appointment time, a full transcript, and a recording in the dashboard. Total user attention required: about 30 seconds at the start.

How ClawCall differs from the developer voice platforms in this space

If you have searched for AI phone agents recently, you have probably seen Retell AI, Bland, Vapi, Synthflow, Vocode, and Air.ai in the results. These are all real, capable platforms — but they are infrastructure to build voice products, not finished products you point at a phone number. Retell exposes a press_digit function and a thoughtful navigation model, and is genuinely strong if you are a contact-center team building your own IVR-replacement agent. Bland and Vapi let developers spin up custom voice agents priced per minute, which works well when an engineering team wants full control of the prompt, tools, and call-flow logic. Synthflow and Vocode target similar build-it-yourself audiences with different SDK trade-offs. Air.ai and Regal are oriented around outbound sales rather than consumer task-completion. On the consumer side, Jarvis.cx, CallFluent, HoldForMe.ai, ClawTalk, ClawdTalk, CallBuddy, PollyReach, Chirp AI, and AgentPhone overlap more directly — most solve the hold-time problem, and the differences are in pricing model, disclosure defaults, voicemail behaviour, and whether they expose a developer interface at all. Inbound-receptionist tools like Goodcall, Rosie, Numa, and Replicant solve a different problem entirely: they answer your business line rather than making outbound calls on your behalf. ClawCall is built for the reader who wants the call made without building anything: a free trial of 30 calls and 30 minutes, whichever lasts later, with no credit card, flat $4.99/mo Unlimited or $8.99/mo Unlimited Reserve pricing instead of per-minute billing, hard rules that the agent always discloses it is an AI and can leave voicemail when instructed, and a drop-in skill for Claude Code, Cursor, ClawHub, and OpenClaw that gives an AI agent a working phone number in seconds. For most readers of this article, that is the right fit.

When phone-tree navigation will fail (and what to do about it)

Even a well-built AI phone agent does not navigate every IVR successfully on every call. The honest failure modes are worth knowing. First, IVRs that require visual confirmation — a passcode sent by text, a verification link emailed mid-call, or a CAPTCHA-style audio puzzle — cannot be completed by the agent alone. ClawCall handles this with the loop_in_user tool: when the agent hits a verification step it cannot solve, it patches the live call to the user's own phone, the user completes the verification, and the agent rejoins to finish the rest. Second, IVRs whose prompts are so badly mastered that even humans mishear them will sometimes route the agent to the wrong department. The fix is the same fix a human would use: hang up and try again, or escalate by asking the human who eventually answers to transfer the call internally. Third, some businesses block calls from VoIP numbers entirely. There is nothing the agent can do about this — the call never connects in the first place — and the user will see a clear failure reason in the dashboard. Fourth, the deliberate constraints: the agent will not impersonate the user, will not pretend to be human, and will leave voicemail only when instructed. These are design choices, not bugs, and they exist because the alternative — an AI that lies about being a person and drops unsupervised messages — is the version of this technology that ends up regulated out of existence. The constraints are the reason an AI phone agent is safe to point at a doctor's office or a utility company without supervising every call.

Frequently asked

How does an AI phone agent know which IVR option to press?
The agent transcribes the menu prompt with real-time speech-to-text, then passes the transcript to a language model along with the goal of the call. The model picks the option whose described outcome best matches the goal — not necessarily by literal keyword match, since IVRs phrase the same intent in many different ways ("billing," "account services," "payment options"). The chosen digit is then sent as a DTMF tone after the prompt finishes playing, with a small settle delay to avoid being ignored by the IVR's barge-in protection. The loop repeats at each menu level until the agent reaches a human, a hold queue, or completes the task inside the IVR.
Why do AI agents wait a moment before pressing a digit?
Many enterprise IVRs ignore DTMF input that arrives while the menu prompt is still playing — a feature called barge-in suppression. Others have an inter-digit timeout shorter than what most voice platforms default to, meaning rapid-fire digits can be split into two partial sequences that both fail. A well-built AI phone agent waits for the prompt audio to fully finish, adds a configurable settle delay of a few hundred milliseconds, sends the first digit, and paces subsequent digits at roughly human keypress cadence. Getting this timing right is the difference between a 95% routing success rate and roughly 65% — the wrong-department rate on common enterprise IVRs when timing is naive.
What happens if the AI gets put on hold for an hour?
It waits. The agent keeps the call open and runs a lightweight classifier on the inbound audio to distinguish three states: hold music (keep waiting), recorded announcements (keep waiting, note any wait-time info), and a live human voice (transition immediately into conversation mode). It also watches for failure cases like hangup tones, voicemail prompts, or after-hours messages. With flat-rate pricing like ClawCall's $4.99/mo Unlimited plan, long holds cost nothing extra — which is part of why hold-elimination has become the most popular consumer use case for AI phone agents. The user goes back to work and gets notified the moment a human is on the line.
Will the AI agent pretend to be me when the human answers?
ClawCall will not. The agent leads with a clean handoff like "Hi, I'm an AI assistant calling on behalf of Sarah about appointment 74821" — it identifies itself as an AI, names the person it is calling for, and states the purpose of the call. If the human directly asks whether it is a person, the agent always discloses that it is an AI. This is a hard rule, not a configurable behaviour. It exists because some receptionists will hang up the moment they suspect a robot — and they have every right to — and leading with honesty is what gets the request actually completed instead of stonewalled. Other platforms vary in their disclosure defaults; this is worth checking before pointing one at a sensitive call.
What if the IVR requires a verification code sent to my phone?
ClawCall handles this with a feature called loop_in_user. When the AI hits a step it cannot complete on its own — a one-time passcode sent by SMS to your phone, a verification link emailed mid-call, a security question only you know — the agent patches the live call to your own phone using a second outbound number from the pool. You join the call, complete the verification step in your own voice, and the agent rejoins or hands off to finish the rest of the task. Bridge calls consume two numbers from your concurrent-call quota, but the experience is seamless: the human on the other end stays on the line throughout, and the call continues without restarting from the top of the IVR.
How does this compare to Apple's Hold For Me or Google's Hold for Me features?
Those are screen features that wait for a human to come back on the line and then ping you to take the call yourself. They do not navigate the IVR for you, they do not state the purpose of the call, and they do not negotiate any outcome — you still have to handle the entire conversation manually once a human picks up. An AI phone agent does the whole job: navigates the menu, waits on hold, talks to the human, and reports back with a transcript and recording. That is the right tool when the call has a concrete goal (reschedule, dispute, cancel, confirm) rather than just "I need to talk to someone eventually." For pure queue-skipping with no other work to do, the built-in screen features are fine.

Related on clawcall.dev

← Back to blog
Use ClawCall on iMessage