We deployed a voice agent for a logistics company in Palmerston North. 200+ calls a day, booking courier pickups. Launch day went perfectly. Week one, no issues.
Week three, their pickup completion rate dropped from 78% to 61%. Nobody complained. The dashboard looked fine.
Average call duration, normal. Transfer rate, normal.
The problem? A prompt change we'd made to handle a new service zone introduced a subtle redirect. Callers asking about rural deliveries got looped back to the main menu.
They weren't hanging up angry. They were just quietly giving up.
We only caught it because a human reviewed the transcripts. No automated metric flagged it. Your agent won't tell you it's broken. You have to watch.
Why "Set and Forget" Fails
You've heard this pitch: "Deploy your AI agent and let it run." It's tempting. And it works — until it doesn't.
Voice agents degrade in ways no dashboard catches. A new competitor name confuses your intent detection. A seasonal product change makes your knowledge base answers wrong. A caller accent your agent handled fine in testing fails at scale.
Scale doesn't forgive sloppiness. It amplifies it. A small issue repeated across 500 calls a day becomes your brand's reputation problem by Friday.

Smarter every week — if humans watch.
What Your Dashboard Shows (and What It Misses)
Retell's analytics give you the basics. These matter. Track them from day one:
But dashboards show you averages. And averages lie.
Your P50 call duration looks great at 2 minutes. Your P90 is 8 minutes. 10% of your callers are trapped in loops, repeating themselves, getting nowhere. The dashboard says "all clear." The callers say something very different.
The Metrics That Actually Predict Failure
After managing 50+ live voice agents, here are the signals we watch that most teams ignore:
Repeat callers within 24 hours. If the same number calls back within a day, your agent failed on the first attempt. Track this. We've seen agents with 85% resolution rates that were actually 70% — because 15% of callers just tried again.
Conversation depth vs outcome. A caller who reaches turn 12 of your conversation flow and then transfers to a human didn't have a good experience. Long conversations with negative outcomes are your worst-case scenario.
Silence gaps exceeding 3 seconds. Your caller said something your agent didn't understand. It paused. Your caller repeated themselves.
These gaps kill trust — and they don't show up in latency metrics because your LLM responded. It just responded with confusion.
Knowledge base miss rate. How often does your agent retrieve irrelevant chunks from your knowledge base? A 20% miss rate means 1 in 5 answers is wrong. Your callers notice before you do.
The Weekly Optimisation Loop
Every agent we manage at Waboom AI goes through a weekly cycle:
Monday: Data Review. Pull the week's numbers. Flag anomalies — duration spikes, transfer rate changes, sentiment drops, repeat callers. Compare against your baseline from the previous 4 weeks.
Tuesday-Wednesday: Transcript Review. Read the worst 10% of calls. Not the summaries — the actual transcripts. Find the exact turn where your conversation broke.
Was it a prompt issue? A knowledge base gap? A missing conversation path?
Thursday: Fix and Test. Make surgical changes. Not rewrites. Adjust the specific node, prompt, or condition that caused the failure. Run batch simulation tests to verify your fix doesn't break other paths.
Friday: Deploy and Report. Push the changes to production. Send your client a report showing what changed, why, and the expected impact.
This loop runs every week. Not monthly. Not quarterly. Weekly.
Your agent improves 15% per quarter when you maintain this rhythm.
Surgical Optimisation: The Examples
Here's what a typical fix looks like. Your agent's greeting was:
Before: "Thank you for calling. I understand you may have a question about your account. I'd be happy to help you with that today. Could you please tell me what you're calling about so I can direct you to the right information?"
After: "Hi, this is the account team. What can I help with?"
Same intent. 80% fewer tokens. Your caller gets to the point 4 seconds faster.
Another example. Your agent kept transferring callers who asked about refunds — even when the answer was in your knowledge base.
The issue? The transfer trigger was too broad. "Caller mentions money" caught refund queries alongside legitimate payment questions.
Fix: narrow the transfer condition to "caller requests a refund AND the refund amount exceeds 500." Everything under that threshold, your agent handles.
Transfer rate dropped 23% in one week.
When to Escalate to Humans
Your agent should handle the routine. Humans should handle the exceptions. Here's where we draw the line:
Your agent handles:
Your humans handle:
The key is your transfer with full context. When your agent hands off to a human, that human should know everything the caller already said.
No repetition. No "can you start from the beginning?"
The Cost of Not Watching
We track the impact across our client base. The numbers tell you everything:
Your agents with weekly optimisation:
Your agents left unmanaged:
Your agent doesn't get better on its own. It drifts. Slowly, invisibly, until something breaks publicly.
Your AI agent needs a human. That's us.
Frequently Asked Questions
How often should voice agents be reviewed and optimised?
Weekly. We run a Monday-to-Friday cycle: data review, transcript analysis, targeted fixes, batch testing, and deployment. Agents that get weekly attention improve 15% per quarter.
Agents left unmanaged degrade at roughly 8% per quarter. The difference compounds fast.
What metrics matter most for voice agent performance?
Beyond the basics (call duration, transfer rate, sentiment), track repeat callers within 24 hours and silence gaps over 3 seconds. Monitor conversation depth vs outcome and knowledge base miss rate too.
These secondary metrics predict failures before your standard dashboard catches them.
Can automated monitoring replace human review?
No. Automated alerts catch the obvious — duration spikes, transfer rate jumps, sentiment drops.
But subtle failures like conversational loops, incorrect knowledge base answers, and prompt drift require a human reading actual transcripts. We use automation for detection and humans for diagnosis.
What does a typical optimisation fix look like?
Most fixes are surgical, not wholesale. Shortening a greeting from 40 words to 10. Narrowing a transfer trigger that's too broad. Adding a missing path for a common edge case.
Each fix is tested with batch simulation before deployment. The goal is small, targeted changes every week — not big rewrites every quarter.
Leonardo Garcia-Curtis
Founder & CEO at Waboom AI. Building voice AI agents that convert.
Ready to Build Your AI Voice Agent?
Let's discuss how Waboom AI can help automate your customer conversations.
Book a Free Demo









