A caller rings your tradie business at 6pm on a Friday. They speak. Half a second later a calm voice answers, books the job, and texts you the details. That half second is the whole game. This is how an AI voice agent works on a real call, step by step, from the audio that hits the line to the spoken word that comes back.

Three-stage AI voice agent pipeline turning caller audio into a spoken reply in under one second

The whole call lives inside a sub-second loop: hear, think, speak.

What actually happens in the second after a caller speaks?

In the second after a caller speaks, the agent turns their voice into text, decides what to say, and turns that decision back into speech. All three happen in a pipeline, often before the caller has finished their next breath. The target is sub-second from end of speech to first spoken word.

Picture the clinic phone at 8.55am. A patient asks to move their appointment. The agent has to hear "move my appointment", understand it, check the calendar, and start talking back. If that takes three seconds, the patient thinks the line dropped. If it takes 800 milliseconds, the call feels human.

We disclose on every call that the caller is speaking with an AI. That honesty does not slow the call down. The speed is decided by the pipeline, not by the greeting. For the plain-English version of what this thing even is, read our breakdown of what an AI voice agent actually is. This piece is the mechanism.

What are the steps in the voice pipeline from audio to spoken reply?

The pipeline has three stages. First, speech-to-text converts the caller's audio into words. Second, the language model reads those words plus the conversation so far and writes a reply. Third, text-to-speech turns that reply into a natural voice. Each stage adds a few hundred milliseconds at most.

The trick is that the stages overlap. The agent does not wait for the caller to stop, then transcribe everything, then think, then speak. It streams. Words flow into the model as they arrive. The reply starts generating before the full thought is written.

We build the agent to start speaking on the first chunk of its reply, not the last. The caller hears the opening word while the rest is still being produced. That is how we hold the budget under a second. If you want to see how this powers a working line, look at what our voice agents do on a live call.

Latency budget bar showing speech-to-text, language model and text-to-speech stages summing under one second

Every stage gets a slice of the budget, and the open is the slice that counts.

Why does first-token latency decide whether a cold call survives?

First-token latency is the gap between the caller finishing and the agent starting to speak. On a cold call it is make or break. A seller who picks up an unknown number gives you about two seconds of patience. Dead air past that, and they hang up before you say a word.

We saw this in a Sydney sales campaign. The agent produced 141 vendor leads in 90 days at $32.74 per seller. None of that happens if the open feels robotic. The first word has to land fast and sound warm.

This is why we do not over-index on per-minute cost when we explain how it works. Yes, calls run about 80 cents a minute in NZD or AUD, billed by the second. An answered call averages around 30 seconds. But the money is not the mechanism. The mechanism is the open.

For the deep version, we wrote a full guide to keeping voice AI fast under load. We also wrote a separate piece on how we hunt down where the delay hides. Latency is not one number. It is a stack of small delays you hunt down one by one.

Want to hear the open for yourself?

The first word is the thing you cannot fake on a cold call. Listen to how our voice agents answer in under a second and judge the speed on your own ear.

How does the agent know when the caller has finished talking?

The agent listens for the end of a turn. It watches the audio for a pause, the falling tone of a finished sentence, and the shape of the words. When it is confident the caller has stopped, it starts its reply. Too eager and it talks over them. Too slow and the call drags.

This is harder than it sounds. People pause mid-sentence. A builder reading a job number off a docket will stop for two seconds, then keep going. If the agent jumps in there, it sounds rude. We tune the endpointing so the agent waits through natural pauses but never leaves real dead air.

We test this against real recordings, not clean studio audio. The goal is the agent that feels like it is actually listening, not counting down a timer. Knowing what answer to expect next helps here too, because the agent reads the turn in context.

What happens when the caller interrupts the agent mid-sentence?

The agent stops talking the instant the caller speaks. This is called barge-in. The caller says "actually, make it Tuesday" halfway through the agent's sentence, and the agent shuts up and listens. No talking over the top. No waiting for the agent to finish its scripted line.

This matters more than people expect. Humans interrupt constantly. A receptionist who kept reading her script while you tried to correct her would drive you spare. So the agent has to detect the new audio, cut its own speech, throw away the half-spoken reply, and re-plan from what you just said.

We wrote a whole piece on handling interruptions cleanly because getting it wrong is the fastest way to sound fake. The agent that handles interruptions cleanly feels far more human than one that bulldozes through.

Timeline showing a caller interrupting an AI voice agent mid-sentence and the agent stopping instantly to listen

Barge-in: the agent cuts its own speech the moment you talk over it.

How does the agent stay accurate and avoid making things up?

The agent stays accurate because we ground it in your real data and we constrain what it can say. It does not guess your opening hours or invent a price. It reads from the facts you give it. When it does not know, it says so and offers to take a message or book a callback.

This is the difference between a demo toy and a tool you put on your main line. A made-up answer on a real call costs you a customer and your reputation. We pull from your booking system, your service list, and your rules, so the answer is your answer.

We wrote a dedicated guide on keeping the agent from inventing answers because this is the question every serious buyer asks. On security and data, here is the honest split. Your portal, transcripts, and structured call records sit on our Sydney servers. Live audio is processed offshore while the call is happening. We do not pretend everything stays in one country.

Why does a demo sound perfect but a real mobile call is harder?

A demo runs on clean audio over a stable connection. A real mobile call runs over a patchy network, with wind, road noise, and a caller in a ute on the motorway. The pipeline is the same, but the input is messier and the connection adds delay you cannot remove.

This is the honest part most vendors skip. A polished demo proves the agent can talk. It does not prove the agent can hear a frustrated customer through a bad signal at 5pm. We test on the hard calls because that is where the money is. A Christchurch developer booked viewings at $7.12 each on real calls, not demo calls.

The fix is not magic. It is tighter endpointing, better noise handling, and shaving every spare millisecond out of the open so the bad network has more room. We also make the agent sound human under pressure. The work is in the messy calls, not the showroom.

What does a 200-dial campaign actually cost?

Outbound is cheaper than people guess. A 200-dial campaign runs about $100 NZD. Connect rate sits between 47 and 65 percent. Around 20 to 25 percent of those dials turn into a real conversation longer than a minute. The rest are voicemails and no-answers, billed by the second so you pay almost nothing for them.

Compare that to a part-time receptionist at $28 to $35 an hour before KiwiSaver, ACC, and holiday pay. The agent does not get tired at dial 180. It opens call 200 with the same warm first word it used on call one. That consistency is the real product.

Ready to hear it on your own line?

See how our AI voice agents handle the open, the interruption, and the booking on a real call. Book a demo and listen to the first word.

Frequently Asked Questions

How fast does the agent reply after I stop talking?

The target is under one second from when you stop speaking to the agent's first word. On a good connection we hold the open near 800 milliseconds. That speed makes the call feel human rather than robotic, especially on a cold outbound call where a stranger's patience runs out in about two seconds.

Can the agent be interrupted like a real person?

Yes. The agent stops talking the instant you speak over it, throws away its half-finished reply, and listens to what you just said. This is barge-in. It is one of the biggest tells between an agent that feels human and one that bulldozes through its script while you try to correct it.

Will the agent make up answers it does not know?

No, if it is set up properly. We ground the agent in your real booking system, hours, and prices, and we constrain what it can say. When it does not know, it says so and offers to take a message or book a callback rather than guessing and costing you a customer.

How much does a call actually cost?

Calls run about 80 cents a minute in NZD or AUD, billed by the second. An answered call averages around 30 seconds, so about 40 cents. A one to two minute booking costs roughly one to two dollars. A 200-dial outbound campaign comes in around $100 NZD total.

Where does my call data live?

Your portal, transcripts, and structured call records sit on our Sydney servers. The live audio is processed offshore while the call is happening. We are honest about that split rather than claiming everything stays in one country, because you should know exactly where your records sit.

Does the caller know they are talking to an AI?

Yes. We disclose on every single call that the caller is speaking with an AI. It is the right thing to do and it does not slow the call down or hurt the booking rate. The greeting is honest, the speed comes from the pipeline, and callers book anyway when the agent is fast and accurate.

The whole call lives inside a sub-second loop: hear, think, speak.

What actually happens in the second after a caller speaks?

What are the steps in the voice pipeline from audio to spoken reply?

Every stage gets a slice of the budget, and the open is the slice that counts.