When an AI voice agent navigates an IVR menu, it rings a supplier, council, or bank and works its way to a human. It listens to the recorded options, presses the right one, waits on hold, then talks to whoever picks up. Your team does not sit there.
Picture a builder ringing a merchant to chase a delivery. He gets a recorded menu, presses two, waits nine minutes, then a transfer drops him back to the start. That is twenty minutes gone before a word is exchanged. Our agent eats that wait so your people do not.
This is a how it works piece. It covers the outbound side, where the agent calls out and meets someone else's phone tree. If you want the bigger picture, start with our overview of what these agents do across inbound and outbound.
The agent meets someone else's phone tree, picks the right branch, and waits out the hold so your team does not.
What is an IVR, and why do AI agents have to deal with them?
An IVR is the recorded phone menu you hit when you ring a big organisation. Press one for accounts, press two for deliveries, please hold. AI agents meet these on outbound calls because the supplier, council, or bank on the other end almost always sits behind one.
The acronym stands for interactive voice response. You have heard a thousand of them. A robotic voice reads a list, you press a key or say a word, and the system routes you.
When our agent rings out on your behalf, it is the caller now. It does not control the menu. It has to listen, decide, and act exactly like a person would, just faster and without the sigh.
That is the whole job on these calls. Get past the machine, reach the human, handle the reason for the call. The agent treats the menu as an obstacle, not the destination.
How does an AI voice agent navigate someone else's phone menu?
The agent listens to the spoken menu in real time and matches what it hears to the goal of the call. It chooses the option that moves it forward. If the call is about a late delivery, it picks the deliveries branch. It does this by understanding meaning, not by following a fixed script.
Most older auto-diallers fail here. They were built to play a recording at whoever answers. They cannot cope with a menu that says press three for trade accounts because they are not listening, they are talking.
Our agent works the other way. It hears the options, holds the call's purpose in mind, and maps one to the other. We dig into how that decision logic works in our piece on routing a call by what the caller means.
The agent decides on what it hears, not on a button a human pre-programmed. So when the menu reads five options, it picks the one that fits the task. That is the difference between a tool that breaks on the first menu and one that gets through.
The agent maps the call's purpose onto the menu it hears, then presses the branch that fits.
How does it handle press-one menus and hold music?
For press-one menus the agent sends the keypad tone for the right option, the same beep your thumb makes. For hold music it simply waits, listening the whole time, ready the second a human says hello. It does not hang up, get bored, or wander off.
Hold is where people leak time. A nine minute wait to a council line is normal. A staff member on thirty dollars an hour standing there is roughly four dollars fifty of wages, plus the job they are not doing.
The agent waits at about eighty cents a minute, billed by the second. Nine minutes on hold is about seven dollars twenty in call cost, and zero minutes of your team's day. It listens through the music so it never misses the pickup.
The moment a human comes on, the agent switches from waiting to talking. It discloses it is an AI, states the reason for the call, and gets on with it. No awkward dead air while someone realises the line is live.
When the human picks up, the call becomes a normal conversation. If the matter needs a person on your side, the agent can hand the live call across with full context, so nobody repeats themselves.
Stop paying people to listen to hold music.
The dull supplier and council calls are exactly what our outbound voice agents are built to take off your desk.
What happens when the menu changes or loops?
When a menu changes or a transfer dumps the agent back to the start, it re-listens and re-chooses. It does not blindly repeat its last move. If it gets stuck in a loop or hits a dead end, it stops, logs what happened, and hands the task back to your team.
Phone trees are not stable. A council updates its menu. A bank adds a security step. A supplier reroutes trade calls to a new number at 5pm on a Friday.
A brittle system pressing the same key every time would loop forever. Our agent treats each menu fresh. It listens again, finds the branch that fits, and presses accordingly.
There is a hard limit on patience by design. If the agent cycles through the same menu twice with no progress, it gives up gracefully. It does not burn ten dollars looping. It marks the call for a human and moves on.
When should it use keypad tones instead of speaking?
The agent uses keypad tones when the menu asks for a number, an account, or a press-one choice. It speaks when the menu is built to take voice commands, like say accounts or say the name of the person you want. It reads which mode the menu expects and matches it.
Old menus want presses. Press one, enter your nine digit account number, press hash. The agent sends clean tones for those, more reliable than speaking digits down a noisy line.
Newer menus say tell me why you are calling. There the agent talks, in plain words, because that is what the system listens for. Picking the wrong mode is how calls stall, so the agent reads the cue first.
The deciding factor is always the first half second of clarity, not the cost. On any live call the make or break moment is the instant the line opens. A sub second response when the human says hello is what keeps the conversation natural.
Tones for press-one menus, plain words for voice menus: the agent reads the cue and picks the mode.
What kinds of calls is this useful for in NZ and AU?
This helps on any routine outbound call where a New Zealand or Australian business waits on hold to reach a supplier, council, utility, or bank. Chasing a delivery. Confirming a balance. Checking a consent. Following up an invoice. The dull, hold-heavy calls nobody on the team wants.
Think of a property manager ringing the council to check a building consent status. Or an accounts clerk calling three suppliers to chase missing invoices. Each call is two minutes of talking wrapped in fifteen minutes of menus and hold.
A 200-dial outbound campaign of these runs about one hundred dollars in call cost. Compare that to a part-time staffer working a phone all morning. They cost twenty eight to thirty five dollars an hour, before KiwiSaver or super, ACC, and holiday pay.
The agent shines on volume and tedium. We have seen this pattern on the sales side too. A Sydney sales agent produced 141 vendor leads in 90 days at thirty two dollars seventy four per seller. The utility-call use is the same engine pointed at admin instead of selling. For the full picture of where these agents fit, see our main voice agents page.
What are the limits, and when does it hand back to a person?
The agent hands back to a person when a call needs a judgement call, a negotiation, or a decision it is not authorised to make. It is built for getting through menus and handling defined tasks. It will not argue a disputed bill or agree new terms. When it hits that edge, it stops and routes the job to your team.
These agents are not a replacement for human judgement on the hard calls. They are brilliant at the repetitive 80 percent and they know to fold on the tricky 20. We are honest about that line in our piece on why voice agents still need people.
There are practical limits too. Some menus demand information the agent was not given, like a one time code texted to a person's phone. Some lines route to a queue with no end. The agent logs these, flags them, and a person finishes the job.
On data, here is the honest split. Your call records, transcripts, and structured notes sit on our Sydney servers. The live audio is processed offshore while the call is happening. We do not pretend every byte stays onshore, and the agent discloses it is an AI on every call it makes.
Hand over the calls your team dreads.
See how our AI voice agents ring out, work the menu, and reach a human while your people do real work.
Frequently Asked Questions
Can an AI voice agent really press the right menu option on its own?
Yes. The agent listens to the spoken menu, matches the option to the purpose of the call, then sends the keypad tone or the spoken command the menu expects. It is choosing based on what it hears in the moment, not following a fixed sequence a person set up in advance.
How much does it cost to have an agent wait on hold?
Hold time bills at about eighty cents a minute in NZD or AUD, by the second. A nine minute wait is roughly seven dollars twenty in call cost. That replaces a staff member at twenty eight to thirty five dollars an hour standing by the phone doing nothing else.
Does the agent tell the other side it is an AI?
Yes. On every outbound call it makes, the agent discloses that the person is speaking with an AI. That holds whether it is navigating a menu or talking to a human at a supplier, council, or bank. Disclosure is built into the call, not optional.
What happens if the menu sends it in a loop?
The agent listens again each time the menu changes and re-picks the right branch. If it cycles the same menu twice with no progress, it stops by design rather than looping forever. It logs the call, flags it, and hands the task back to a person on your team to finish.
Is this the same as the agent answering my inbound calls?
No. This piece is about outbound calls, where the agent rings out and meets someone else's phone tree. Answering your own callers is a separate setup. Our inbound answering service handles incoming calls, while this navigation skill is about reaching a human at the other end of a supplier or council line.
Where does my call data end up?
Your structured call records, transcripts, and notes sit on our Sydney servers. The live audio is processed offshore while the call is live. We are upfront about that split rather than claiming everything stays onshore, and every call the agent makes opens with an AI disclosure.
Leonardo Garcia-Curtis
Founder & CEO at Waboom AI. Building voice AI agents that convert.
Ready to Build Your AI Voice Agent?
Let's discuss how Waboom AI can help automate your customer conversations.
Book a Free Demo


