6 min read · Operator notes from the call floor · Last updated 11 May 2026

Last week we put three voice AI engines on a public page. Same agent, same script, same Michelle. Three different models running her under the hood. The whole point: feel the difference, then pick the one that fits how your business actually runs.

If you have spoken to Michelle already, you know the demo. If you have not, the live version is at our voice demo page. Hit any of the three Talk to Michelle buttons and you will hear it.

Here is the operator take on which engine to pick for what kind of work. No bench-pressing the spec sheet. Just what we have seen across cold calls, inbound bookings, and recall outreach for NZ and Australian clients.

When does GPT 4.1 default make sense?

The default engine sits at the bottom of the three on speed. Median time-to-first-audio is 1.3 seconds in the web demo. Add about 200 milliseconds for the carrier path on a real phone call.

Same reasoning model as 4.1 Fast, but the routing layer does not give it priority. Fine for low-volume FAQ flows, internal IVR, after-hours overflow. Anywhere snap matters less than cost-per-minute.

Not for cold calls. 1.3 seconds is fine on a booking, marginal when a stranger picked up out of curiosity and is deciding whether to stay on the line.

Three voice AI engines compared: GPT 4.1 default, GPT 4.1 Fast, GPT Realtime 1.5 — three soundwaves at different speeds

Why is GPT 4.1 Fast the outbound workhorse?

This is what we use most for outbound campaigns. Same model as the default, but it sits on the platform's priority routing layer. Median time-to-first-audio drops to 1.2 seconds. The last-turn measurement on the live demo is 1.0 second flat.

Why that matters on a cold call: the caller did not initiate. They are deciding in the first three seconds whether to stay or hang up. A response inside one second feels like a person who was already there waiting. Above 1.5 seconds the conversation feels machine. Nielsen Norman Group's response-time research backs the threshold.

The Sydney agent campaign generated 141 vendor leads from 1,997 conversations in 90 days. 7.1% warm-transfer-from-conversation. $32.74 AUD per warm-transferred seller. Most of those conversations ran on 4.1 Fast.

4.1 Fast does not try to be deep. It qualifies the caller against the script, books the meeting straight into your calendar, and hands off to a licensed broker, adviser, or salesperson the moment a question crosses the line.

Cold-call snap: the first-second window where the caller decides whether to stay on the line

What is the trade-off with GPT Realtime 1.5?

Realtime 1.5 is the most natural sounding of the three. The live demo hits 862 milliseconds time-to-first-audio on the last turn, with a median around 956 milliseconds. Sub-second responses, every turn.

But there is one trade-off before picking this engine: the voice library.

The OpenAI Realtime API ships around 10 voices in the library: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar. The vast majority are American-style. Ballad is the one British English voice. None are native Kiwi or Australian. Some sound good. Some feel slightly robotic. None give you the local accent your callers actually expect.

Realtime 1.5 is a speech-to-speech architecture: the model generates audio directly instead of routing through a separate text-to-speech layer. That makes it feel natural. It also means the text-to-speech control layer we have built is not in the conversation. So we cannot help it pronounce Te Reo place names, Aboriginal names, or quirky NZ and AU street names the way a native speaker would.

If your caller will hear "Whakatane" or "Mooloolaba" once a day, that matters. We have written about why a Kiwi accent moves caller acceptance. If your callers are CBD professionals talking about boardroom topics, accent matters less and conversational flow matters more.

Trade-off between native NZ and Aussie accent control and an international voice library

What is the function-call latency gotcha?

This is the one most people miss when they pick a voice model.

When the agent runs a function call (booking a meeting in your CRM, querying a policy admin system, extracting a quote spec from the conversation), GPT 4.1 default and 4.1 Fast both pause for a moment. The pause is small but it is there: the model is thinking about the function output, deciding what to say about it, then speaking.

You can see this in the live demo. Each card shows the latency turns at the bottom as coloured blocks. Green blocks are normal turns. Yellow blocks are turns where the agent ran a function call, and the height of the yellow block shows the latency penalty in milliseconds for that turn. The 4.1 default card has a yellow block mid-conversation. The 4.1 Fast card has one at the start. The Realtime 1.5 card has none.

Honest operator take from running the demo: the gap on a function call is shorter than the numbers suggest. I noticed it once on a 4.1 Fast turn where the agent had to look something up, but I was hyper-aware, actively listening for it. A normal caller almost certainly would not notice. The gap matters most in workflows where every call hits a function: live availability checks, payment processing, policy lookups. For FAQ flows, qualifying flows, and simple message capture, it is a non-issue.

Realtime 1.5 does not have this pause at all. Because the model is generating audio while reasoning rather than after, function calls happen in the background while the agent keeps the conversation moving. OpenAI's Realtime API documentation explains the speech-to-speech architecture in detail.

Function-call latency comparison: speech-to-speech architecture vs text-LLM plus TTS

Want to feel the difference yourself?

Three minutes on the live demo will tell you more than any spec sheet.

Talk to Michelle →

Which engine fits your call type?

The decision framework, written by the call floor:

Cold calling, outbound campaigns, NZ or AU accent matters: 4.1 Fast.

Inbound, conversational depth, accent does not matter to your callers, function calls are frequent: Realtime 1.5.

Internal IVR, low-volume FAQ, cost-sensitive flows: 4.1 default.

The clients we serve in NZ and Australia mostly land on 4.1 Fast because the accent layer matters. Real estate vendor outreach in Auckland or Sydney sounds wrong in an American voice. So does a building inspection booking in Christchurch.

For inbound healthcare answering where the call usually involves the agent booking into a practice management system, Realtime 1.5 can land better when accent matters less than smooth function-call handoffs. Same trade-off, opposite weighting.

The Waboom live demo at /ai-voice-demo lets you try each one before you commit. Spend three minutes on it. The difference is obvious in the first five seconds. You can also see the full voices page for samples, and the pricing page for the per-minute economics across plans.

FAQ

What is the practical latency difference on a real phone call?

The web demo numbers (1.3 seconds, 1.0 second, 862 milliseconds) assume a clean browser-to-server connection. On a real PSTN phone call add roughly 200 milliseconds for the carrier path. The relative gap between engines stays the same. The absolute numbers move up a touch.

Why does 4.1 Fast cost the same as 4.1 default?

Same underlying model, same per-minute price on Waboom. The Fast variant routes through a priority queue, which costs Waboom slightly more on the inference side but stays the same on your invoice. We absorb the difference because the user experience is materially better for outbound.

Do you charge more for Realtime 1.5?

No. All three engines are the same $0.80 per minute on the standard NZD and AUD plans. You pick the engine that fits the call, not the price. Full pricing at /ai-voice-agents/pricing.

Can I switch engines per campaign?

Yes. Each Waboom agent has an engine setting. You can run your inbound on Realtime 1.5 and your outbound on 4.1 Fast, on the same account, same number routing. The portal handles it.

Does the engine affect compliance?

No. NZ Privacy Act 2020 and Australian Privacy Principles handling is identical across engines. The platform layer (recording disclosure, zero-data-retention option, encrypted webhooks) does not change based on which model is running the conversation.

Which engine is Waboom on by default for new accounts?

4.1 Fast for outbound. Realtime 1.5 for inbound where pronunciation does not matter. 4.1 default for low-volume internal flows. We tune the engine per agent during onboarding.

Pick the right engine for your call type.

Spend three minutes on the live demo, then book a 20-minute strategy call when you know which one fits your flow.

Book a Strategy Call →