Tutorial: Audio management and the cache

Level 2 – Lesson 5 of 8 – Understand and manage the audio cache for optimal performance. Audio library controls how TTS output is generated, cached, and replayed. Without understanding caching, teams often think changes “didn’t apply” when they’re hearing old audio.

Understanding the audio cache

What caching is

Cached audio stores previously generated TTS so it can be replayed instantly, reducing latency and keeping repeated phrases consistent.

Cache requirements

Audio is only cached if the same utterance is generated at least twice in a 24-hour window.

One-off utterances will not persist in cache by default.

Managing cached audio

Open Audio Management

Navigate to Voice > Audio library in the platform.

Review cached utterances

Check the list of cached utterances:

Greeting
Transfer / handoff language
SMS offer phrasing
Closings and confirmations

Adjust individual utterances

For any high-frequency utterance:

Open it and review how often it has been used
Adjust stability and clarity for that utterance only if needed
Use the play button to preview changes

Ensure stability for critical phrases

If an utterance must remain stable:

Generate it multiple times in 24 hours, or
Upload a static audio file to overwrite the cached version

Check your understanding

Interaction style (response latency)

Interaction style controls how quickly the agent responds after detecting user speech. This directly affects interruption rate and perceived naturalness.

Turbo
Swift
Balanced
Precise

~400ms latencyExtremely fast, higher interruption risk.

Barge-in

What is barge-in?

Barge-in determines whether callers can interrupt the agent mid-speech.

When to use it

Useful for Turbo mode
Can feel chaotic if enabled without careful phrasing and latency tuning

Pronunciations

Ensure domain-specific terms are spoken clearly and correctly in Call. Pronunciations are defined in the Pronunciations tab under Voice > Advanced settings and applied globally. They modify how text is converted to speech, without changing the underlying text.

When to use pronunciations

Brand names

Product names that are mispronounced

Proper nouns

Locations, people, departments

Numbers or IDs

Structured read-back requirements

Pacing

Phrases where pacing matters for comprehension

How pronunciations work

Matching is done using regular expressions. Replacements can be:

IPA
SSML
Regex capture groups

International Phonetic AlphabetFor precise pronunciation control

Speech Synthesis Markup LanguageSuch as <break> for pauses

Pattern matching\1, \2, etc. for reformatting

Examples

IPA correction

Regex: \bLouvre\bReplacement: /ˈluːvrə/Case sensitive: FALSE

Phone number formatting with pauses

Regex: (\d{3})[ -]?(\d{3})[ -]?(\d{4})Replacement: \1 <break time="0.5s" /> \2 <break time="0.5s" /> \3

Best practices

Incremental

Add pronunciations one at a time

Test thoroughly

Test each change in Call before adding more

Keep it simple

Prefer clarity over complexity – overly complex regex is hard to maintain

Check your understanding

Verification checklist

After any voice or phrasing change:

Start a new call session
Confirm you are hearing updated audio, not a cached version
Validate that turn-taking still feels natural after changing latency or barge-in
Mispronounced terms are corrected consistently
Pauses improve comprehension rather than slowing the call excessively

Try it yourself

Challenge: Fix a mispronounced brand name

Your agent says “Hopper” but it is consistently pronounced incorrectly (sounds like “Hooper”). You also want phone numbers read back with a natural pause between each segment.Write both pronunciation configurations:

IPA correction for “Hopper”
Phone number formatting with 0.5s pauses

Hint

For the IPA, write out what “Hopper” sounds like phonetically. For the phone number, use regex capture groups to split the digits and insert SSML <break> tags.

Example solution

Brand name correction:

Regex: \bHopper\b
Replacement: /ˈhɒpər/
Case sensitive: FALSE

Phone number with pauses:

Regex: (\d{3})[ -]?(\d{3})[ -]?(\d{4})
Replacement: \1 <break time="0.5s" /> \2 <break time="0.5s" /> \3

Check your understanding

← Previous: Advanced settings

Lesson 4 of 8

Next: Global ASR →

Lesson 6 of 8

PolyAcademy

Recipes

Maintain

Glossary

FAQs

Tutorial: Audio management and the cache

Understanding the audio cache

What caching is

Cache requirements

Managing cached audio

Check your understanding

Interaction style (response latency)

Barge-in

Pronunciations

When to use pronunciations

Brand names

Proper nouns

Numbers or IDs

Pacing

How pronunciations work

Examples

Best practices

Incremental

Test thoroughly

Keep it simple

Check your understanding

Verification checklist

Try it yourself

Check your understanding

← Previous: Advanced settings

Next: Global ASR →

​Understanding the audio cache

What caching is

Cache requirements

​Managing cached audio

​Check your understanding

​Interaction style (response latency)

​Barge-in

​Pronunciations

​When to use pronunciations

Brand names

Proper nouns

Numbers or IDs

Pacing

​How pronunciations work

​Examples

​Best practices

Incremental

Test thoroughly

Keep it simple

​Check your understanding

​Verification checklist

​Try it yourself

​Check your understanding

← Previous: Advanced settings

Next: Global ASR →

Understanding the audio cache

Managing cached audio

Check your understanding

Interaction style (response latency)

Barge-in

Pronunciations

When to use pronunciations

How pronunciations work

Examples

Best practices

Check your understanding

Verification checklist

Try it yourself

Check your understanding