Tutorial: Polishing the voice experience

Level 3 – Lesson 5 of 5 – Go beyond usability to create voice experiences that sound genuinely good. After building an agent that works and is easy to use, the final layer is polish: selecting voices that perform well in practice, adding natural filler and hesitation, managing turn-taking, and personalizing based on user context.

The layers of a good voice experience

It works

Speech recognition transcribes correctly, APIs respond, the task can be completed.

It's easy to use

The interaction is efficient, intuitive, and follows the design principles.

It sounds good

Copywriting, voice quality, turn-taking, and personalization make the experience enjoyable. This is the focus of this lesson.

Voice selection and quality

Pick a voice that sounds good in practice, not just in samples. If you need to regenerate 50 times to find one good take, that voice won’t produce consistent quality in a live deployment. After selecting a voice:

Listen to the most common things the agent says: greeting, “how can I help”, “anything else”, and the main flow prompts
The LLM often generates similar phrasing for repeated scenarios – these get cached, so make sure they sound good
Regenerate cached audio until it sounds right

Written copy always looks more informal than it sounds. Don’t let written reviews make you over-formalise. When in doubt, build a short audio prototype and listen back – text on a page always sounds more formal than it does when spoken aloud.

Natural filler and hesitation

Real humans pause, say “um”, and hesitate – especially when they’re thinking. Adding small amounts of this to agent speech makes it sound more natural. In linguistics this is called disfluency, and it includes filled pauses (“um”, “uh”), slight repetitions, and drawn-out sounds.

When to use it

Context	What to add	Example
API call / lookup	Filler phrase	”Um, let me just have a look at what space we have…”
Complex instructions	Slight hesitation	”So what you’ll want to do is, uh, go to settings and then…”
After a misunderstanding	Drawn-out sound, regrouping	”Hmm, what was it I can do for you?”

Why it works

During API calls: filler sounds like someone checking another screen – it matches what the user expects is happening
After misunderstandings: hesitation sounds like someone regrouping after a miscommunication, which is exactly what’s happening
In general: small pauses signal that the agent is “thinking”, which makes silence less awkward

Keep it subtle. Too much filler makes the agent sound confused rather than natural. Use it situationally, not on every turn.

Turn-taking

Turn-taking – how the agent and user take turns speaking – is one of the most impactful aspects of voice experience, and one of the hardest to control at the project level. Three common problems:

Too much latency – the agent takes too long to respond after the user finishes speaking. Users disengage.
Interruptions – the agent starts speaking before the user has finished. Users get frustrated.
No barge-in – the user cannot interrupt the agent, even when the agent is saying something wrong or irrelevant.

Many turn-taking issues need platform-level improvements rather than project-level fixes. If you encounter persistent turn-taking problems, document specific examples and contact support.

What you can control

Response length – shorter responses reduce the chance of the agent and user talking over each other
Interaction style settings – adjust latency thresholds in audio management
Barge-in configuration – enable or disable based on the interaction type
Front-load key information – put the important part first, so even if the user interrupts, they’ve heard what matters

Personalisation

Personalisation uses information about the user to tailor the experience. It works at three levels:

From the current conversation

If the user gives their name, you can use it – but not on every turn. LLMs tend to overuse names, which sounds scripted. Use sparingly for warmth.

From API data

If you can see a user’s recent activity, use it to shortcut the conversation:

“I can see you just canceled a flight. Is that what you’re calling about?”

This proves competence immediately and shortens the interaction.

From previous calls

If the user called before and was sent an SMS for self-service, and they’re calling back:

“I see you were calling about this earlier. Was that text not working for you?”

This kind of continuity across calls signals attentiveness and builds user confidence in the system.

Personalisation can feel intrusive if overdone. Use it when it clearly helps the user reach their goal faster. Avoid making users feel surveilled.

Matching the user’s style

People naturally adjust how they speak depending on who they’re talking to. In voice agents, this happens partially through the LLM (which adjusts vocabulary and formality based on user input). For now, focus on:

Word choice – if the user uses informal language, the agent should match
Pacing – if the user speaks slowly, don’t rush them with rapid-fire responses
Formality – match the user’s level of formality

Try it yourself

Challenge: Design the experience around an API lookup

A user asks to track their order. The flow collects the tracking number and then makes an API call that takes 2-3 seconds.Design:

What does the agent say while the API call runs?
How do you handle a successful lookup?
How do you handle a failed lookup?

For each, consider: filler, tone, brevity, and what information to say first.

Example solution

During API call:

“Okay, let me just pull that up for you…” (Subtle filler – sounds like checking a screen)

Successful lookup:

“Got it – your order’s been shipped and should arrive Thursday. Want me to send you the tracking link?” (Brief, key info first, natural offer for follow-up)

Failed lookup:

“Hmm, I’m not finding anything for that number. Could you double-check it and try again?” (Hesitation signals regrouping, blames the number not the user)

← Previous: Writing agent speech

Lesson 4 of 5

Level 3 complete →

Recap and next steps

​The layers of a good voice experience

​Voice selection and quality

​Natural filler and hesitation

​When to use it

​Why it works

​Turn-taking

​What you can control

​Personalisation

​From the current conversation

​From API data

​From previous calls

​Matching the user’s style

​Try it yourself

← Previous: Writing agent speech

Level 3 complete →

The layers of a good voice experience

Voice selection and quality

Natural filler and hesitation

When to use it

Why it works

Turn-taking

What you can control

Personalisation

From the current conversation

From API data

From previous calls

Matching the user’s style

Try it yourself