
The agent’s initial greeting is hardcoded and sent directly to TTS (Text-to-Speech) without running the LLM or processing any Rules. Write the greeting in the language you expect callers to hear. Any rules and logic begin after the greeting.
Processing stages
A conversation moves through the following stages:1. Input and processing
1. Input and processing
- Caller: The caller speaks into their device.
- Audio Stream: The spoken input is captured and sent for transcription.
- ASR Provider: The system receives the raw audio.
- ASR Service: Converts the audio into text.
- ASR Processing: Searches for transcription issues and applies any relevant corrections.
- Transcript Text → Corrected Transcript: The corrected transcript is passed to Retrieval.
- Retrieval: Pulls relevant topics retrieved from the knowledge base to provide context for the response.
2. Compute prompt and generate response
2. Compute prompt and generate response
- Compute Prompt: The system builds an LLM prompt using retrieved topics, system knowledge, and conversation history.
- Run LLM: The LLM processes the request and determines whether to return:
- Returned Text: A direct text response.
- Returned Function: A function call.
- Execute Function (if applicable): Runs the function and passes the result back to the LLM.
- LLM Refinement: If a function result is returned, the LLM updates its response before proceeding.
3. Streaming and chunking
3. Streaming and chunking
- Chunk LLM Output: The response is broken into chunks before being sent to text-to-speech.
- Postprocess Chunks: Applies rules such as stop keywords to remove unnecessary phrases.
- Stream Partial Responses: The system sends chunks as soon as they are ready, rather than waiting for the full response.
- TTS Service: Converts text chunks into speech.
- TTS Provider: Streams the synthesized speech back to the caller.
4. Post-processing and handoff
4. Post-processing and handoff
- Live Handoff (if applicable): If escalation is needed, the agent triggers a live handoff.
- Conversation Logs: The system stores conversation history and logs for analytics.
- Final Response: The caller hears the completed response as it streams, without waiting for the entire message.
Advanced: How response streaming works
PolyAI agents don’t wait for the full response before speaking. Instead, responses are processed and streamed in real time:- LLM Streaming: Words are generated and sent continuously.
- Chunking: Before reaching TTS, responses are broken into chunks for controlled delivery.
- Postprocessing: Stop keywords remove unnecessary phrases before they are spoken.
- TTS Streaming: The caller hears speech as soon as it’s processed, rather than waiting for the entire response.