Build
Conversation flow
Understanding the step-by-step processing of a PolyAI voice agent.
This page explains how a PolyAI agent processes a conversation, from caller input to response generation.
Expand the image to zoom.
The agent’s initial greeting is hardcoded and sent directly to TTS (Text-to-Speech) without running the LLM or processing any Rules. Write the greeting in the language you expect callers to hear. Any rules and logic begin after the greeting.
Processing stages
A conversation moves through the following stages:
1. Input and processing
1. Input and processing
- Caller: The caller speaks into their device.
- Audio Stream: The spoken input is captured and sent for transcription.
- ASR Provider: The system receives the raw audio.
- ASR Service: Converts the audio into text.
- ASR Processing: Searches for transcription issues and applies any relevant corrections.
- Transcript Text → Corrected Transcript: The corrected transcript is passed to Retrieval.
- Retrieval: Pulls relevant topics retrieved from the knowledge base to provide context for the response.
2. Compute prompt and generate response
2. Compute prompt and generate response
- Compute Prompt: The system builds an LLM prompt using retrieved topics, system knowledge, and conversation history.
- Run LLM: The LLM processes the request and determines whether to return:
- Returned Text: A direct text response.
- Returned Function: A function call.
- Execute Function (if applicable): Runs the function and passes the result back to the LLM.
- LLM Refinement: If a function result is returned, the LLM updates its response before proceeding.
3. Streaming and chunking
3. Streaming and chunking
- Chunk LLM Output: The response is broken into chunks before being sent to text-to-speech.
- Postprocess Chunks: Applies rules such as stop keywords to remove unnecessary phrases.
- Stream Partial Responses: The system sends chunks as soon as they are ready, rather than waiting for the full response.
- TTS Service: Converts text chunks into speech.
- TTS Provider: Streams the synthesized speech back to the caller.
4. Post-processing and handoff
4. Post-processing and handoff
- Live Handoff (if applicable): If escalation is needed, the agent triggers a live handoff.
- Conversation Logs: The system stores conversation history and logs for analytics.
- Final Response: The caller hears the completed response as it streams, without waiting for the entire message.
Advanced: How response streaming works
PolyAI agents don’t wait for the full response before speaking. Instead, responses are processed and streamed in real time:
- LLM Streaming: Words are generated and sent continuously.
- Chunking: Before reaching TTS, responses are broken into chunks for controlled delivery.
- Postprocessing: Stop keywords remove unnecessary phrases before they are spoken.
- TTS Streaming: The caller hears speech as soon as it’s processed, rather than waiting for the entire response.
Watch it in action
This video visualizes the conversation flow, showing how responses are processed, chunked, and streamed: