Skip to main content
Level 2 — Lesson 3 of 6 — Understand and manage the audio cache for optimal performance.
Audio Management controls how Text-to-Speech (TTS) output is generated, cached, and replayed. Without understanding this layer, teams often think changes “didn’t apply” when they are actually hearing cached audio.

Understanding the audio cache

What caching is

Cached audio stores previously generated TTS so it can be replayed instantly, reducing latency and keeping repeated phrases consistent.

Cache requirements

Audio is only cached if the same utterance is generated at least twice within a 24-hour window.
One-off utterances will not persist in cache by default.

Managing cached audio

1

Open Audio Management

Navigate to Audio Management in the platform.
2

Review cached utterances

Check the list of cached utterances:
  • Greeting
  • Transfer / handoff language
  • SMS offer phrasing
  • Closings and confirmations
3

Adjust individual utterances

For any high-frequency utterance:
  • Open it and review how often it has been used
  • Adjust stability and clarity for that utterance only if needed
  • Use the play button to preview changes
4

Ensure stability for critical phrases

If an utterance must remain stable:
  • Generate it multiple times within 24 hours, or
  • Upload a static audio file to overwrite the cached version

Interaction style (response latency)

Interaction style controls how quickly the agent responds after detecting user speech. This directly affects interruption rate and perceived naturalness.
~400ms latencyExtremely fast, higher interruption risk.

Barge-in

Barge-in determines whether callers can interrupt the agent mid-speech.
  • Useful for fast modes (Turbo/Swift)
  • Can feel chaotic if enabled without careful phrasing and latency tuning

Pronunciations

Ensure domain-specific terms are spoken clearly and correctly in Call.
Pronunciations are defined in Rules and applied globally. They modify how text is converted to speech, without changing the underlying text.

When to use pronunciations

Brand names

Product names that are mispronounced

Proper nouns

Locations, people, departments

Numbers or IDs

Structured read-back requirements

Pacing

Phrases where pacing matters for comprehension

How pronunciations work

Matching is done using regular expressions. Replacements can be:
International Phonetic AlphabetFor precise pronunciation control

Examples

Regex: \bLouvre\bReplacement: /ˈluːvrə/Case sensitive: FALSE
Regex: (\d{3})[ -]?(\d{3})[ -]?(\d{4})Replacement: \1 <break time="0.5s" /> \2 <break time="0.5s" /> \3

Best practices

Incremental

Add pronunciations one at a time

Test thoroughly

Test each change in Call before adding more

Keep it simple

Prefer clarity over cleverness—overly complex regex is hard to maintain

Verification checklist

After any voice or phrasing change:
  • Start a new call session
  • Confirm you are hearing updated audio, not a cached version
  • Validate that turn-taking still feels natural after changing latency or barge-in
  • Mispronounced terms are corrected consistently
  • Pauses improve comprehension rather than slowing the call excessively