Level 3 — Lesson 5 of 5 — Go beyond usability to create voice experiences that sound genuinely good.
The layers of a good voice experience
It's easy to use
The interaction is efficient, intuitive, and follows the design principles.
Voice selection and quality
Pick a voice that sounds good in practice, not just in samples. If you need to regenerate 50 times to find one good take, that voice won’t produce consistent quality in a live deployment. After selecting a voice:- Listen to the most common things the agent says: greeting, “how can I help”, “anything else”, and the main flow prompts
- The LLM often generates similar phrasing for repeated scenarios — these get cached, so make sure they sound good
- Regenerate cached audio until it sounds right
Natural filler and hesitation
Real humans pause, say “um”, and hesitate — especially when they’re thinking. Adding small amounts of this to agent speech makes it sound more natural. In linguistics this is called disfluency, and it includes filled pauses (“um”, “uh”), slight repetitions, and drawn-out sounds.When to use it
| Context | What to add | Example |
|---|---|---|
| API call / lookup | Filler phrase | ”Um, let me just have a look at what space we have…” |
| Complex instructions | Slight hesitation | ”So what you’ll want to do is, uh, go to settings and then…” |
| After a misunderstanding | Drawn-out sound, regrouping | ”Hmm, what was it I can do for you?” |
Why it works
- During API calls: filler sounds like someone checking another screen — it matches what the user expects is happening
- After misunderstandings: hesitation sounds like someone regrouping after a miscommunication, which is exactly what’s happening
- In general: small pauses signal that the agent is “thinking”, which makes silence less awkward
Turn-taking
Turn-taking — how the agent and user take turns speaking — is one of the most impactful aspects of voice experience, and one of the hardest to control at the project level. Three common problems:- Too much latency — the agent takes too long to respond after the user finishes speaking. Users disengage.
- Interruptions — the agent starts speaking before the user has finished. Users get frustrated.
- No barge-in — the user cannot interrupt the agent, even when the agent is saying something wrong or irrelevant.
Many turn-taking issues need platform-level improvements rather than project-level fixes. Your role is to identify and document these issues with specific examples so the engineering team can prioritise improvements.
What you can control
- Response length — shorter responses reduce the chance of the agent and user talking over each other
- Interaction style settings — adjust latency thresholds in audio management
- Barge-in configuration — enable or disable based on the interaction type
- Front-load key information — put the important part first, so even if the user interrupts, they’ve heard what matters
Personalisation
Personalisation uses information about the user to tailor the experience. It works at three levels:From the current conversation
If the user gives their name, you can use it — but not on every turn. LLMs tend to overuse names, which sounds scripted. Use sparingly for warmth.From API data
If you can see a user’s recent activity, use it to shortcut the conversation:“I can see you just canceled a flight. Is that what you’re calling about?”This proves competence immediately and shortens the interaction.
From previous calls
If the user called before and was sent an SMS for self-service, and they’re calling back:“I see you were calling about this earlier. Was that text not working for you?”This kind of continuity across calls makes the system feel like it remembers and cares.
Matching the user’s style
People naturally adjust how they speak depending on who they’re talking to. In voice agents, this happens partially through the LLM (which adjusts vocabulary and formality based on user input). For now, focus on:- Word choice — if the user uses informal language, the agent should match
- Pacing — if the user speaks slowly, don’t rush them with rapid-fire responses
- Formality — match the user’s level of formality
Try it yourself
Challenge: Design the experience around an API lookup
A user asks to track their order. The flow collects the tracking number and then makes an API call that takes 2-3 seconds.Design:
- What does the agent say while the API call runs?
- How do you handle a successful lookup?
- How do you handle a failed lookup?
Example solution
Example solution
During API call:
“Okay, let me just pull that up for you…” (Subtle filler — sounds like checking a screen)Successful lookup:
“Got it — your order’s been shipped and should arrive Thursday. Want me to send you the tracking link?” (Brief, key info first, natural offer for follow-up)Failed lookup:
“Hmm, I’m not finding anything for that number. Could you double-check it and try again?” (Hesitation signals regrouping, blames the number not the user)

