Skip to main content
This page provides information about the training data used in Raven v3, in line with PolyAI’s commitment to transparency, responsible AI development, and applicable regulatory expectations. The details below describe the provenance, composition, processing, and intended use of the datasets used to develop Raven v3.

System overview

FieldDetails
System NameRaven
DeveloperPolyAI
Release Date16 September 2025
Versionv3

Dataset summary

CategoryDescription
Source or OwnerData is sourced from PolyAI customers to the extent contractually authorised by customers and permitted by applicable law, or otherwise generated by PolyAI.
Purchased or LicensedLicensed or otherwise owned by PolyAI.
Time Period of Data CollectionNovember 2024 – August 2025
Date of First Use in DevelopmentDecember 2024
Scale of DatasetHundreds of thousands of conversational turns across tens of thousands of conversations.
Entirely Public DomainNo

Intellectual property considerations

CategoryDescription
Copyright, Trademark, or Patent ProtectionThe dataset may include information protected by copyright or trademark law belonging to PolyAI customers or PolyAI.
Ownership and RightsAll data used is licensed to or owned by PolyAI in accordance with contractual agreements and applicable law.

Personal and consumer data

CategoryDescription
Contains Personal InformationPolyAI takes all reasonable steps to redact personal information from the dataset prior to use.
Contains Aggregate Consumer InformationNo

Synthetic data usage

CategoryDescription
Use of Synthetic DataYes. PolyAI augments real-world data with synthetic data where necessary to broaden coverage or improve specific system capabilities.

Data processing and preparation

The dataset used for Raven v3 has undergone multiple processing steps to ensure quality, safety, and suitability for training customer service agents.
Processing StepDescription
RedactionRemoval of personal information.
TranslationSupport for multilingual customer service use cases.
FilteringSelection of desired data distributions to improve specific system capabilities.
LabellingAnnotation to provide efficient learning signals during system training and evaluation.

Types of data used

CategoryDescription
Data FormatConversational logs.
Labelling MethodologyConversations are labelled as positive and/or preferred customer service interactions and/or assigned graded preference scores.

Purpose and intended use

CategoryDescription
Purpose in Relation to the SystemThe dataset supports Raven’s intended purpose of powering agentic customer service conversations by providing real-world and synthetic examples of high-quality customer service interactions.

Ongoing governance

PolyAI regularly reviews its data practices to ensure alignment with evolving legal, regulatory, and ethical standards. Dataset composition and processing methods may be updated over time to reflect improvements in safety, coverage, and system performance.