ElevenLabs v3 vs Flash v2.5: when each one wins

Leonardo Garcia-Curtis05/05/2026

TL;DR

Should you use ElevenLabs v3 for your voice agent? Probably not for live phone calls. v3 is the most expressive AI voice model on the market, but it is not real-time. ElevenLabs themselves tell developers to use Flash v2.5 (about 75ms first audio) for live voice agents. Use v3 where the audio is pre-rendered and played many times: audiobooks, marketing voiceovers, IVR greetings, dubbing. Use Flash v2.5 for outbound sales, inbound support and anything that has to answer a real human in real time. Same voice IDs work on both, so you keep your voice and just pick the engine.

In February 2026, ElevenLabs shipped v3 to general availability. They called it the most emotionally expressive AI voice model on the market. In the same documentation, they told developers not to use it for real-time voice agents.

That tension is the whole story.

v3 is the rendering engine. Flash v2.5 is the call-floor engine. Both run the same voices, so you keep your chosen voice and just pick the engine that fits the job.

Here is when each one wins, with the numbers, and the catches the launch coverage glossed over.

In this article

1. The 800ms cliff
2. Where v3 actually wins
3. Where Flash v2.5 wins
4. Same voice, different engine
5. The tradeoffs of v3 nobody warns you about
6. How to pick
7. Frequently asked questions

Latency

The 800ms cliff

Live conversation has a beat. Cross-linguistic studies put the gap between turns at roughly 100ms, with most languages clustering between 0 and 300ms (Stivers et al, PNAS 2009). Your ear has been tuning itself to that rhythm since you could speak.

For voice agents, anything under 500ms still feels like talking to a human. Past 800ms the wheels start coming off. Past one second, the caller wonders if the line dropped. Past three seconds, they have hung up.

Conversational latency timeline showing markers at 100ms human turn-taking, 500ms still natural, 800ms the cliff, and 1000ms plus they hang up

ElevenLabs Flash v2.5 ships first audio at roughly 75ms. Fast enough to live underneath the rest of your stack (speech-to-text, model thinking, tool calling) and still leave room before the cliff.

v3 has not published a number. Reviewers describe it as not real-time. Architecturally it is a larger model with a higher-fidelity codec. Both of those trade speed for sound.

Which is why ElevenLabs themselves list Flash v2.5, Flash v2, or Multilingual v2 as the recommended models for the Agents Platform. v3 is not on that list.

If your agent already feels slow, the voice model is rarely the only thing to look at. We wrote the full teardown on where voice agent latency actually hides.

Where v3 Wins

Where v3 actually wins

Pull v3 out of the live-call use case and it is genuinely impressive. Anything rendered once and played many times sits in its lane.

The trick is inline audio tags. Drop [laughs], [whispering], [sarcastic], [curious], or [breathy] inside the script and v3 performs them. Reviewers describe the output as indistinguishable from real speech in tone and flow.

It also handles multi-speaker dialogue (unlimited speakers in one render) and 70+ languages, covering roughly 90% of the global population. The numbers and symbols accuracy is 68% better than the alpha that landed in June 2025.

Studio condenser microphone surrounded by floating audio tag chips reading laughs, whispering, sarcastic, curious, and breathy

The use cases v3 was built for, in plain English:

→Audiobooks and long-form narration. Performance carries the listener for hours. v3 holds it.
→Film, TV and game dubbing. Multi-speaker mode plus 70+ languages.
→Podcast generation and character work. Two voices in one render.
→Marketing voiceovers and brand spots. Render once, run on every platform.
→IVR prompts pre-rendered overnight. Greeting, on-hold copy, closure messages, voicemail. Generate them in v3, ship them as static files.
→Training content and accessibility (AAC). Anywhere a human listens to the same line many times.

Where Flash Wins

Where Flash v2.5 wins

Anything live. The list is short and obvious.

→Outbound sales agents. Cold calls, vendor outreach, arrears chase.
→Inbound support agents. Reception, triage, after-hours cover.
→IVR prompts generated on the fly. The ones that quote your caller's appointment time back at them.
→Anything where the caller is on the line right now and the agent has to respond.

For those jobs, the fact that v3 sounds slightly warmer in a side-by-side does not matter. The 800ms beat is the gate. Cross it and the conversation breaks.

A useful rule of thumb: if you can render the audio file the night before, use v3. If the script is composed at speaking time, use Flash v2.5.

75ms first audio

That is the number Flash v2.5 ships at, and that is why ElevenLabs themselves told developers to use it instead of v3 on live agents. Quality in the wrong dimension costs you the call.

Voice Library

Same voice, different engine

The wrinkle that surprises people: voice IDs work on both engines. The "model" is the rendering engine. The voice ID is the speaker identity.

In practice, that means a single curated voice (a polished NZ presenter, an Australian inside-sales voice, a UK call agent) can be used at the studio engine for hold music and a campaign launch video, and at the call-floor engine for the actual outbound prospecting. Same person. Same character. Different engine for different work.

The voice is your brand. The engine is your job. Pick the engine for the job.

Hear it in action

32 curated voices, all running on the ElevenLabs stack

Pick a region. Hit play. Each one works under v3 (pre-recorded marketing) and Flash v2.5 (live calls).

Luke

New Zealand

Clear New Zealand accent, deep and steady. Great for professional customer service.

Harry

New Zealand

Soft-spoken and warm Kiwi voice, perfect for friendly conversations.

Sam

New Zealand

Playful and warm Kiwi voice, great for friendly customer conversations.

Kim

New Zealand

Friendly and descriptive Kiwi voice, perfect for informative calls.

Melissa

New Zealand

Calm and steady Kiwi narrator, ideal for professional communications.

Torey

New Zealand

Confident Kiwi voice with natural conversational flow.

Jessie

New Zealand

Friendly educator voice, warm and encouraging Kiwi accent.

Want the full library? See all 32 voices on the voices page.

The Catches

The tradeoffs of v3 nobody warns you about

v3 is genuinely good. It is also genuinely fragile. Six things will bite you if you run it without care.

Catch	What it means
Voice mismatch	If the underlying voice was recorded calm and you tag it [shouting], v3 often won't deliver. The base voice has to sit close to the target emotion or the tag goes ignored.
Stacking artefacts	Stack too many [break] tags and the output speeds up, garbles, or reads the tag aloud. One or two breaks per paragraph is the sane limit.
Library inconsistency	Voices from the public Voice Library produce less consistent v3 output than they did under v2.5 or v2. Tighter prompts and a base voice closer to your target tone help.
Premium pricing	On Retell, ElevenLabs TTS sits at $0.04 per minute after the March 2026 reset. Retell Platform Voices, Minimax, Fish Audio, Cartesia and OpenAI all start at $0.015. v3 is the higher tier.
5,000 character cap	Per generation. Long monologues need chunking. Flash v2.5 takes 40,000 in one shot.
Prompt sensitivity	v3 cares more about how you structure your writing. Natural sentences, emotional context, performable phrasing. Tight rigid script gets you mediocre output.

None of these kill v3. They just mean you have to use it for the work it was designed for.

The Decision

How to pick

Two questions. Answer them and the engine picks itself.

One. Is the audio rendered or live?

Rendered? v3.

Live? Flash v2.5.

Two. Is the script composed by your team or by the model in the moment?

Composed by your team and polished before the listener hears it? v3.

Composed by the model from caller context, on the line, right now? Flash v2.5.

That is the decision. Benchmarks do not take phone calls.

Want to hear properly picked voice tech on a live call?

Hear our voice agents on a real outbound or inbound flow. We pick the engine for the job, every time.

AI voice agents · Book a demo · Best LLM for voice agents

Frequently asked questions

What is ElevenLabs v3?

ElevenLabs v3 is the most expressive text-to-speech model in the ElevenLabs lineup, generally available since 2 February 2026 after an alpha that opened in June 2025. It supports 70+ languages, multi-speaker dialogue, and inline audio tags like [laughs], [whispering], and [sarcastic] that the model performs as part of the read. It is not a real-time model.

Can I use ElevenLabs v3 for voice agents?

ElevenLabs themselves recommend Flash v2.5, Flash v2, or Multilingual v2 for their Agents Platform. v3 is excluded because of its higher first-token latency. For pre-rendered IVR prompts, audiobooks, marketing voiceovers, and any audio rendered once and played many times, v3 is excellent. For live calls, Flash v2.5 is the right model.

What is the latency of ElevenLabs Flash v2.5?

Flash v2.5 ships first audio at approximately 75ms. That sits well inside the 500ms natural-conversation budget and well under the 800ms threshold beyond which a phone caller starts to feel the silence.

Does using v3 mean I have to change my voice?

No. Voice IDs are independent of the rendering model. The same voice you use under Flash v2.5 for live calls can also be rendered through v3 for marketing voiceovers and pre-recorded IVR. The voice is the speaker. The model is the engine.

How much does ElevenLabs v3 cost on Retell?

After the 23 March 2026 pricing reset, ElevenLabs TTS (which includes v3) sits at $0.04 per minute on Retell. Retell Platform Voices, Minimax, Fish Audio, Cartesia and OpenAI all start at $0.015 per minute. v3 is the premium tier, not the default.

When should I not use v3?

Anywhere the script is composed at speaking time and a human is waiting on the line. Outbound sales agents. Inbound support. On-the-fly IVR. Anywhere the 800ms beat matters.

What is the difference between ElevenLabs v3 and ElevenLabs Agents?

They are different products. v3 is a text-to-speech model. ElevenLabs Agents (sometimes called ElevenAgents) is a real-time voice agent platform with its own low-latency stack, interruption detection and action triggering. The Agents Platform uses Flash v2.5 by default, not v3. When you read about "ElevenLabs voice agents", check which one the article is talking about.

ElevenLabs v3 vs Flash v2.5: when each one wins

Leonardo Garcia-Curtis05/05/2026

TL;DR

Catch

What it means

Voice mismatch

If the underlying voice was recorded calm and you tag it [shouting], v3 often won't deliver. The base voice has to sit close to the target emotion or the tag goes ignored.

Stacking artefacts

Stack too many [break] tags and the output speeds up, garbles, or reads the tag aloud. One or two breaks per paragraph is the sane limit.

Library inconsistency

Voices from the public Voice Library produce less consistent v3 output than they did under v2.5 or v2. Tighter prompts and a base voice closer to your target tone help.

Premium pricing

On Retell, ElevenLabs TTS sits at $0.04 per minute after the March 2026 reset. Retell Platform Voices, Minimax, Fish Audio, Cartesia and OpenAI all start at $0.015. v3 is the higher tier.

5,000 character cap

Per generation. Long monologues need chunking. Flash v2.5 takes 40,000 in one shot.

Prompt sensitivity

v3 cares more about how you structure your writing. Natural sentences, emotional context, performable phrasing. Tight rigid script gets you mediocre output.

ElevenLabs v3 vs Flash v2.5: when each one wins

The 800ms cliff

Where v3 actually wins

Where Flash v2.5 wins

Same voice, different engine

32 curated voices, all running on the ElevenLabs stack