Safety
Ensure your agent is dealing with risky conversations effectively.
The safety dashboard helps you monitor safety-related metrics, track risky conversations, and evaluate how well your agent is handling harmful content. It’s essential for making sure your assistant complies with brand standards and safety expectations.
Metrics
- Caller utterance risk level: Shows how risky incoming messages are and how well the agent manages them.
- Total calls: Total number of calls during the selected period.
- Number of calls managed for risk: How often the safety filters were triggered.
- Percentage of calls managed for risk: How many of those calls involved flagged content.
- Distribution of flagged calls: Highlights trends in flagged calls over time.
- Distribution count of flagged calls: Shows peaks in flagged call volume.
- Caller utterance category distribution:
- Broken down into hate, self-harm, sexual content, and violence.
- Uses color-coded visuals for easy tracking.
Editing safety filters
To manage your filters, go to Settings in the sidebar.
PolyAI content filters are designed to catch harmful input from users and prevent inappropriate output from your assistant. Filters combine PolyAI’s models with third-party services like Azure OpenAI to keep conversations safe.
How filters work
Content filters run on both sides of the conversation:
- User input: Catches toxic or inappropriate speech before it reaches the assistant.
- AI output: Prevents the assistant from responding with anything unsafe or non-compliant.
Filtering happens in real time and targets specific categories of risky content.
Filtering categories and severity levels
Filters target four core risk categories:
- Hate
- Sexual
- Violence
- Self-harm
Each category has four severity levels:
- Safe (label only — no filtering)
- Low (most content allowed)
- Medium (balanced filtering)
- High (strict filtering)
You can choose different levels per category depending on your risk appetite. Safe-level content is always labeled but never blocked.
Category details
Category | Description |
---|---|
Hate | Covers content that attacks or discriminates based on race, ethnicity, nationality, religion, gender identity, sexual orientation, disability, or appearance. Includes bullying, harassment, and slurs. |
Sexual | Content involving explicit anatomy, sexual acts, or romantic/erotic themes — including abusive or exploitative content. Includes vulgar language, nudity, child exploitation, and grooming. |
Violence | Covers physical harm, threats, weapons, terrorism, and other violent acts or intimidation. Includes mentions of guns, attacks, or stalking. |
Self-harm | Mentions of suicide, self-injury, eating disorders, or any content about hurting oneself. |
Additional filtering
- Jailbreak risk detection: Filters also watch for attempts to bypass or disable safety features.
Language support
Content filters have been trained and tested in the following languages:
- English
- German
- Japanese
- Spanish
- French
- Italian
- Portuguese
- Chinese
Other languages are supported, but performance may vary. Always test thoroughly in your target language to ensure filters behave as expected.
Best practices
- Test thoroughly: Always run your own tests to validate how filters behave with your content.
- Use the right level: Don’t default to High — find a balance that avoids both harm and over-filtering.
- Standardize features: If you’re using filters in templates or shared projects, try to use the same flows and function names across them.
For more technical background on Microsoft’s content filtering service, see Azure OpenAI safety documentation.