Training data

This page provides information about the training data used in Raven, in line with PolyAI’s commitment to transparency, responsible AI development, and applicable regulatory expectations. The details below describe the provenance, composition, processing, and intended use of the datasets used to develop Raven v3 and v3.5.

System overview

Field	Details
System Name	Raven
Developer	PolyAI

Version	Release date
v3	16 September 2025
v3.5	10 March 2026

Dataset summary

Category	v3	v3.5
Source or Owner	Data is sourced from PolyAI customers to the extent contractually authorized by customers and permitted by applicable law, or otherwise generated by PolyAI.	Data is sourced from PolyAI customers to the extent contractually authorized by customers and permitted by applicable law, or otherwise generated by PolyAI.
Purchased or Licensed	Licensed or otherwise owned by PolyAI.	Licensed or otherwise owned by PolyAI.
Time Period of Data Collection	November 2024 – August 2025	November 2024 – February 2026
Date of First Use in Development	December 2024	January 2026
Scale of Dataset	Hundreds of thousands of conversational turns across tens of thousands of conversations.	Hundreds of thousands of conversational turns across tens of thousands of conversations.
Entirely Public Domain	No	No

Intellectual property considerations

Category	Description
Copyright, Trademark, or Patent Protection	The dataset may include information protected by copyright or trademark law belonging to PolyAI customers or PolyAI.
Ownership and Rights	All data used is licensed to or owned by PolyAI in accordance with contractual agreements and applicable law.

Personal and consumer data

Category	Description
Contains Personal Information	PolyAI takes all reasonable steps to redact personal information from the dataset prior to use.
Contains Aggregate Consumer Information	No

Synthetic data usage

Category	Description
Use of Synthetic Data	Yes. PolyAI augments real-world data with synthetic data where necessary to broaden coverage or improve specific system capabilities.

Data processing and preparation

The datasets used for Raven v3 and v3.5 have undergone multiple processing steps for quality, safety, and suitability for training customer service agents.

Processing Step	Description
Redaction	Removal of personal information.
Translation	Support for multilingual customer service use cases.
Filtering	Selection of desired data distributions to improve specific system capabilities.
Labeling	Annotation to provide efficient learning signals during system training and evaluation.

Types of data used

Category	Description
Data Format	Conversational logs.
Labeling Methodology	Conversations are labeled as positive and/or preferred customer service interactions and/or assigned graded preference scores.

Purpose and intended use

Category	Description
Purpose in Relation to the System	The dataset supports Raven’s intended purpose of powering agentic customer service conversations by providing real-world and synthetic examples of high-quality customer service interactions.

Ongoing governance

PolyAI regularly reviews its data practices against current legal, regulatory, and ethical standards. Dataset composition and processing methods may be updated over time to reflect improvements in safety, coverage, and system performance.

Get started

Studio Assistant

Analytics

Conversations

Custom Dashboards

Behavior

Knowledge

Flows

Tools

Extend with code

Testing

Real-time config

Voice

Messaging

Integrations

Deployments

Widgets

Account

System overview

Dataset summary

Intellectual property considerations

Personal and consumer data

Synthetic data usage

Data processing and preparation

Types of data used

Purpose and intended use

Ongoing governance

​System overview

​Dataset summary

​Intellectual property considerations

​Personal and consumer data

​Synthetic data usage

​Data processing and preparation

​Types of data used

​Purpose and intended use

​Ongoing governance

System overview

Dataset summary

Intellectual property considerations

Personal and consumer data

Synthetic data usage

Data processing and preparation

Types of data used

Purpose and intended use

Ongoing governance