This page provides information about the training data used in Raven v3, in line with PolyAI’s commitment to transparency, responsible AI development, and applicable regulatory expectations.
The details below describe the provenance, composition, processing, and intended use of the datasets used to develop Raven v3.
System overview
| Field | Details |
|---|
| System Name | Raven |
| Developer | PolyAI |
| Release Date | 16 September 2025 |
| Version | v3 |
Dataset summary
| Category | Description |
|---|
| Source or Owner | Data is sourced from PolyAI customers to the extent contractually authorised by customers and permitted by applicable law, or otherwise generated by PolyAI. |
| Purchased or Licensed | Licensed or otherwise owned by PolyAI. |
| Time Period of Data Collection | November 2024 – August 2025 |
| Date of First Use in Development | December 2024 |
| Scale of Dataset | Hundreds of thousands of conversational turns across tens of thousands of conversations. |
| Entirely Public Domain | No |
Intellectual property considerations
| Category | Description |
|---|
| Copyright, Trademark, or Patent Protection | The dataset may include information protected by copyright or trademark law belonging to PolyAI customers or PolyAI. |
| Ownership and Rights | All data used is licensed to or owned by PolyAI in accordance with contractual agreements and applicable law. |
Personal and consumer data
| Category | Description |
|---|
| Contains Personal Information | PolyAI takes all reasonable steps to redact personal information from the dataset prior to use. |
| Contains Aggregate Consumer Information | No |
Synthetic data usage
| Category | Description |
|---|
| Use of Synthetic Data | Yes. PolyAI augments real-world data with synthetic data where necessary to broaden coverage or improve specific system capabilities. |
Data processing and preparation
The dataset used for Raven v3 has undergone multiple processing steps to ensure quality, safety, and suitability for training customer service agents.
| Processing Step | Description |
|---|
| Redaction | Removal of personal information. |
| Translation | Support for multilingual customer service use cases. |
| Filtering | Selection of desired data distributions to improve specific system capabilities. |
| Labelling | Annotation to provide efficient learning signals during system training and evaluation. |
Types of data used
| Category | Description |
|---|
| Data Format | Conversational logs. |
| Labelling Methodology | Conversations are labelled as positive and/or preferred customer service interactions and/or assigned graded preference scores. |
Purpose and intended use
| Category | Description |
|---|
| Purpose in Relation to the System | The dataset supports Raven’s intended purpose of powering agentic customer service conversations by providing real-world and synthetic examples of high-quality customer service interactions. |
Ongoing governance
PolyAI regularly reviews its data practices to ensure alignment with evolving legal, regulatory, and ethical standards. Dataset composition and processing methods may be updated over time to reflect improvements in safety, coverage, and system performance. Last modified on January 28, 2026