Skip to main content
The PolyAI platform supports flexible voice selection for external providers such as ElevenLabs, AWS Polly, and Microsoft Azure TTS.

Provider classes

When picking models, adjusting stability, or accessing third-party providers — use provider-specific TTSVoice classes. You can also optionally adjust clarity and latency_mode to match the agent’s interaction style.

Example: ElevenLabs

from polyai.voice import ElevenLabsVoice

conv.set_voice(
    ElevenLabsVoice(
        provider_voice_id="gDnGxUcsitTxRiGHr904",
        model_id="eleven_flash_v2",
        stability=0.5,
        similarity_boost=0.7,
        clarity=0.8,            # Optional: controls crispness of enunciation
        latency_mode="swift",   # Optional: aligns with interaction-style settings
    )
)

Example: AWS Polly

from polyai.voice import PollyVoice
conv.set_voice(
    PollyVoice(
        provider_voice_id="Joanna",
        engine="neural",
        clarity=0.6,
    )
)

Example: Microsoft Azure TTS

from polyai.voice import AzureVoice

conv.set_voice(
    AzureVoice(
        provider_voice_id="en-US-JennyNeural",
        style="cheerful",
        role="customer-service-rep",
        clarity=0.9,
    )
)

Example: Cartesia

from polyai.voice import CartesiaVoice, Emotion, EmotionKind, EmotionIntensity

conv.set_voice(
    CartesiaVoice(
        provider_voice_id="a1b2c3d4",
        speed=0.0,  # -1.0 (slowest) to 1.0 (fastest)
        emotions=[
            Emotion(EmotionKind.POSITIVITY, EmotionIntensity.HIGH)
        ],
        model_id="sonic"  # or "sonic-preview"
    )
)
Emotion options:
  • EmotionKind: ANGER, POSITIVITY, SURPRISE
  • EmotionIntensity: LOWEST, LOW, HIGH, HIGHEST

Example: Rime

from polyai.voice import RimeVoice

conv.set_voice(
    RimeVoice(
        provider_voice_id="voice_id",
        speech_alpha=1.0,  # <1.0 faster, >1.0 slower
        model_id="mistv2"  # or "mist"
    )
)

Example: Minimax

from polyai.voice import MinimaxVoice

conv.set_voice(
    MinimaxVoice(
        model_id="speech-02-hd",  # or speech-02-turbo, speech-01-hd, speech-01-turbo
        voice_id="voice_id",
        speed=1.0,      # 0.5-2.0
        vol=1.0,        # 0-10
        pitch=0,        # -12 to 12
        emotion="happy" # happy, sad, angry, fearful, disgusted, surprised, neutral
    )
)

Example: Hume

from polyai.voice import HumeVoice

conv.set_voice(
    HumeVoice(
        provider_voice_id="voice_uuid_or_name",
        voice_description="patient, empathetic counselor",  # Optional
        version="2",        # "1" for octave-1, "2" for octave-2
        instant_mode=False, # Ultra-low latency mode
        provider="HUME_AI"  # "CUSTOM_VOICE" or "HUME_AI"
    )
)

Example: Google TTS

from polyai.voice import GoogleVoice

conv.set_voice(
    GoogleVoice(
        provider_voice_id="ja-JP-Neural2-B",
        gender="male"  # "male", "female", or "neutral"
    )
)

Example: Custom provider

from polyai.voice import CustomVoice

conv.set_voice(
    CustomVoice(
        provider="MY_PROVIDER",
        provider_voice_id="voice_id",
        custom_param="value"  # Any additional kwargs
    )
)

Voice randomization

Use VoiceWeighting to randomly select a voice based on weighted probabilities:
from polyai.voice import VoiceWeighting, ElevenLabsVoice

conv.randomize_voice([
    VoiceWeighting(
        voice=ElevenLabsVoice(provider_voice_id="voice1"),
        weight=0.7
    ),
    VoiceWeighting(
        voice=ElevenLabsVoice(provider_voice_id="voice2"),
        weight=0.3
    ),
])
  • Weights must sum to 1.0.
  • Voices without explicit weights share the remaining probability equally.

Cache behavior

  • Changing model_id does not automatically invalidate cached audio.
  • To reset cached audio:
    • Go to Audio → Cache and delete existing entries.
    • Or, create a new voice entry with a different voice_id.
    • You can prepend the model ID to the voice ID (e.g. eleven_flash_v2/a1b2c3...) if you want to isolate caches across models.

Additional options

  • clarity – fine-tunes articulation sharpness per utterance (0.0–1.0).
  • latency_mode – chooses a response profile (“swift”, “balanced”, “precise”, “turbo”) consistent with Interaction style.
  • stability – controls tone variability across runs.
  • randomize_voice() – supports external providers for weighted selection.