Level 2 — Lesson 3 of 6 — Understand and manage the audio cache for optimal performance.
Understanding the audio cache
What caching is
Cached audio stores previously generated TTS so it can be replayed instantly, reducing latency and keeping repeated phrases consistent.
Cache requirements
Audio is only cached if the same utterance is generated at least twice within a 24-hour window.
One-off utterances will not persist in cache by default.
Managing cached audio
Review cached utterances
Check the list of cached utterances:
- Greeting
- Transfer / handoff language
- SMS offer phrasing
- Closings and confirmations
Adjust individual utterances
For any high-frequency utterance:
- Open it and review how often it has been used
- Adjust stability and clarity for that utterance only if needed
- Use the play button to preview changes
Interaction style (response latency)
Interaction style controls how quickly the agent responds after detecting user speech. This directly affects interruption rate and perceived naturalness.- Turbo
- Swift
- Balanced
- Precise
~400ms latencyExtremely fast, higher interruption risk.
Barge-in
What is barge-in?
What is barge-in?
Barge-in determines whether callers can interrupt the agent mid-speech.
When to use it
When to use it
- Useful for fast modes (Turbo/Swift)
- Can feel chaotic if enabled without careful phrasing and latency tuning
Pronunciations
Ensure domain-specific terms are spoken clearly and correctly in Call.
When to use pronunciations
Brand names
Product names that are mispronounced
Proper nouns
Locations, people, departments
Numbers or IDs
Structured read-back requirements
Pacing
Phrases where pacing matters for comprehension
How pronunciations work
Matching is done using regular expressions. Replacements can be:- IPA
- SSML
- Regex capture groups
International Phonetic AlphabetFor precise pronunciation control
Examples
IPA correction
IPA correction
Regex:
\bLouvre\bReplacement: /ˈluːvrə/Case sensitive: FALSEPhone number formatting with pauses
Phone number formatting with pauses
Regex:
(\d{3})[ -]?(\d{3})[ -]?(\d{4})Replacement: \1 <break time="0.5s" /> \2 <break time="0.5s" /> \3Best practices
Incremental
Add pronunciations one at a time
Test thoroughly
Test each change in Call before adding more
Keep it simple
Prefer clarity over cleverness—overly complex regex is hard to maintain
Verification checklist
After any voice or phrasing change:
- Start a new call session
- Confirm you are hearing updated audio, not a cached version
- Validate that turn-taking still feels natural after changing latency or barge-in
- Mispronounced terms are corrected consistently
- Pauses improve comprehension rather than slowing the call excessively

