Skip to main content
This page provides a high-level overview of how PolyAI’s conversational AI system works. Understanding this architecture helps you design more effective agents and troubleshoot issues.

How conversations flow

When a caller connects to your PolyAI agent, the conversation passes through several key stages:

1. Telephony layer

The telephony layer handles the phone connection between the caller and your agent. PolyAI supports multiple telephony providers including Twilio, Amazon Connect, and SIP-based systems.

2. Speech recognition (ASR)

The caller’s speech is converted to text using automatic speech recognition (ASR). PolyAI uses advanced models optimized for conversational accuracy, with support for:
  • Multiple languages and accents
  • Industry-specific vocabulary
  • Real-time transcription
  • ASR biasing and keyphrase boosting for domain-specific terms
See also: ASR, ASR biasing

3. Agent service

The agent service is the core of the system. It receives the transcribed user input and coordinates:
  • Language understanding (NLU): Interprets what the user said, their intent, and extracts entities
  • Decision making (Policy engine): Determines the appropriate response based on your configured Managed Topics, flows, and rules by executing nodes in priority order
  • Action execution: Triggers any necessary function calls or API integrations
  • Context management: Maintains dialogue context and turn history throughout the conversation
See also: NLU, Policy engine, Node

4. Response generation

Based on the decision engine’s output, the system generates an appropriate response using your agent’s configured voice, tone, and knowledge. This may involve:
  • Retrieving relevant information using RAG (Retrieval-Augmented Generation)
  • Applying global rules and response control filters
  • Generating contextually appropriate responses via the LLM
See also: RAG, LLM, Response control

5. Text-to-speech (TTS)

The generated response is converted to natural-sounding speech and played back to the caller. PolyAI supports:
  • Multiple TTS providers and custom voices
  • SSML markup for fine-grained control over pronunciation, pauses, and emphasis
  • Custom pronunciations using IPA notation
See also: TTS, SSML, Pronunciations

Data storage

During a conversation, PolyAI maintains several types of data:
Data typePurposeRetention
Dialogue contextTracks the full dialogue history, state variables, and turn data for the current callDuration of call
Turn dataStores individual exchanges (user input, agent response, intents, entities) for analytics and reviewConfigurable
Conversation metadataRecords conversation-level information (duration, variant, environment)Configurable
MetricsRecords events for reporting and dashboardsConfigurable
See also: Dialogue context, Turn, Conversation metadata

Key components you configure

As a builder in Agent Studio, you control how the agent behaves through:
  • Managed Topics: Information the agent uses to answer questions
  • Flows: Structured conversation paths for complex tasks
  • Functions: Custom logic and external integrations
  • Rules: Global behavior constraints
  • Voice settings: How the agent sounds

Processing a single turn

Each turn in a conversation follows this sequence:
1

Receive input

The system captures and transcribes the caller’s speech using ASR.
2

Understand intent

The NLU component analyzes what the caller wants and extracts entities.
3

Retrieve knowledge

Relevant information is fetched from your Managed Topics using RAG (Retrieval-Augmented Generation) via the Ragdoll service.
4

Execute logic

The policy engine evaluates nodes and any active flows or functions are executed.
5

Generate response

The LLM composes a response based on all available context, applying global rules and response control filters.
6

Deliver response

The response is synthesized to speech via TTS and played to the caller.
See also: Turn, Policy engine