Safety

The safety dashboard helps you monitor safety-related metrics, track risky conversations, and evaluate how well your agent is handling harmful content. It’s essential for making sure your agent complies with brand standards and safety expectations. enhanced-safety-dashboard

Metrics

Caller utterance risk level: Shows how risky incoming messages are and how well the agent manages them.
Total calls: Total number of calls during the selected period.
Number of calls managed for risk: How often the safety filters were triggered.
Percentage of calls managed for risk: How many of those calls involved flagged content.
Distribution of flagged calls: Highlights trends in flagged calls over time.
Distribution count of flagged calls: Shows peaks in flagged call volume.
Caller utterance category distribution:
- Broken down into hate, self-harm, sexual content, and violence.
- Uses color-coded visuals for easy tracking.

Editing safety filters

To manage your filters, go to Settings in the sidebar. safety-dashboard

PolyAI content filters are designed to catch harmful input from users and prevent inappropriate output from your agent. Filters combine PolyAI’s models with third-party services like Azure OpenAI to keep conversations safe.

How filters work

Content filters run on both sides of the conversation:

User input: Catches toxic or inappropriate speech before it reaches the agent.
AI output: Prevents the agent from responding with anything unsafe or non-compliant.

Filtering happens in real time and targets specific categories of risky content.

Filtering categories and severity levels

Filters target four core risk categories:

Hate
Sexual
Violence
Self-harm

Each category has four severity levels:

Safe (label only — no filtering)
Low (most content allowed)
Medium (balanced filtering)
High (strict filtering)

You can choose different levels per category depending on your risk appetite. Safe-level content is always labeled but never blocked.

Category details

Category	Description
Hate	Covers content that attacks or discriminates based on race, ethnicity, nationality, religion, gender identity, sexual orientation, disability, or appearance. Includes bullying, harassment, and slurs.
Sexual	Content involving explicit anatomy, sexual acts, or romantic/erotic themes — including abusive or exploitative content. Includes vulgar language, nudity, child exploitation, and grooming.
Violence	Covers physical harm, threats, weapons, terrorism, and other violent acts or intimidation. Includes mentions of guns, attacks, or stalking.
Self-harm	Mentions of suicide, self-injury, eating disorders, or any content about hurting oneself.

Additional filtering

Jailbreak risk detection: Filters also watch for attempts to bypass or disable safety features.

Language support

Content filters have been trained and tested in the following languages:

English
German
Japanese
Spanish
French
Italian
Portuguese
Chinese

Other languages are supported, but performance may vary. Always test thoroughly in your target language to ensure filters behave as expected.

Best practices

Test thoroughly: Always run your own tests to validate how filters behave with your content.
Use the right level: Don’t default to High — find a balance that avoids both harm and over-filtering.
Standardize features: If you’re using filters in templates or shared projects, try to use the same flows and function names across them.

For more technical background on Microsoft’s content filtering service, see Azure OpenAI safety documentation.

Introduction

Manage

Build

Voice

Configure

Troubleshoot

Legal

Metrics

Editing safety filters

How filters work

Filtering categories and severity levels

Category details

Additional filtering

Language support

Best practices

Introduction

Manage

Build

Voice

Configure

Troubleshoot

Legal

​Metrics

​Editing safety filters

​How filters work

​Filtering categories and severity levels

​Category details

​Additional filtering

​Language support

​Best practices

Metrics

Editing safety filters

How filters work

Filtering categories and severity levels

Category details

Additional filtering

Language support

Best practices