AI is meaningful when you can naturally interact with it.

Explore datasets

Mission

We are an audio data research company.
‍Our mission is to bring AI into the real world through voice, the most important interface to human interaction.

Process

We develop audio datasets with the same rigor researchers bring to models.

Hypothesize

Determine an audio AI capability we wish to unlock.

ii.

Design

Architect a shape of data to teach models that capability.

iii.

Experiment

Launch a targeted data collection.

iv.

Evaluate & Iterate

Measure data quality and tune the collection until a small, high-signal set is achieved.

Productionize

Scale the dataset to thousands of hours.

vi.

Release

Publish the dataset, and continuously improve it over time.

Two overlapping clusters of vertical bars converging in the center on a colorful gradient oval, with labels “a.” on the left and “b.” on the right, illustrating hypothesis transition.

A dense grid of vertical bars shifting toward the right into a circle filled with a rainbow gradient, representing the architected data shape for model training.

Three staggered horizontal layers of vertical bars converging on a central multicolored diamond, labeled “1.”, “2.”, and “3.”, symbolizing targeted data-collection experiments.

A broken ring of vertical bars with one quadrant shown as a rainbow gradient, annotated “A.”, “B.”, and “C.”, depicting the cycle of measuring quality and tuning a small high-signal set.

A stylized waveform made of vertical bars colored in sequential rainbow segments, illustrating the scaled dataset spanning thousands of hours.

Abstract graphic showing vertical bars, symbolizing audio dataset release.

Hypothesize

Develop a perspective on a new capability we believe audio models should have.

ii.

Design

Architect a shape of data to teach models that capability.

iii.

Experiment

Launch a targeted data collection.

iv.

Evaluate & Iterate

Measure data quality and tune the collection until a small, high-signal set is achieved.

Productionize

Scale the dataset to thousands of hours.

vi.

Release

Publish the dataset, and continuously improve it over time.

Our datasets are used by Fortune 100 companies and research labs that work with speech recognition, translation, synthesis, and conversational AI.

Featured Datasets

A dataset suite designed for speech-to-speech, multilingual, and voice interaction systems

Converse

Our flagship English dataset consists of channel-separated, natural two-speaker conversations spanning a wide range of topics.

Atlas

A multilingual dataset spanning 15+ languages. It includes metadata on dialects and accents and follows the same format as Converse.

Chorus

A dataset of conversations involving three or more speakers. Originally designed for training speaker-separation and diarization models.

Dialog

A collection of expert conversations across a range of domains.

Browse more datasets or design one with us

We offer additional proprietary datasets not listed here.
Contact us to request a sample, explore more options, or collaborate on a new dataset.

1. Request samples

We will set up a quick call to understand your use case and then send you relevant data samples.

2. Purchase access

Enter a data license agreement for the dataset and use-cases your team needs.

3. Receive data

For off-the-shelf datasets, we will grant your team access within one to two days.

Bonus: Experiment with us

We frequently partner with research teams to design new shapes of data for any use case.

Careers

Join us to shape the future of audio AI

We’re hiring for research, engineering, and operations roles.

See open roles

News

Updates on our progress

View all

$50 Million Series B Funding Led by Meritech With Participation From NVIDIA, Alt Capital, Amplify, First Round Capital, and Y Combinator

AI is meaningful when you can naturally interact with it.

We are an audio data research company.
‍Our mission is to bring AI into the real world through voice, the most important interface to human interaction.

We develop audio datasets with the same rigor researchers bring to models.

Hypothesize

Design

Experiment

Evaluate & Iterate

Productionize

Release

Hypothesize

Design

Experiment

Evaluate & Iterate

Productionize

Release

Our datasets are used by Fortune 100 companies and research labs that work with speech recognition, translation, synthesis, and conversational AI.

A dataset suite designed for speech-to-speech, multilingual, and voice interaction systems

Converse

Atlas

Chorus

Dialog

1. Request samples

2. Purchase access

3. Receive data

Bonus: Experiment with us

Join us to shape the future of audio AI

Updates on our progress

Announcing Our $50M Series B Led by Meritech

Announcing Our $25M Series A Led by Alt Capital

Announcing Our $5M Seed Round Led by First Round

We are an audio data research company. ‍Our mission is to bring AI into the real world through voice, the most important interface to human interaction.

We develop audio datasets with the same rigor researchers bring to models.

Hypothesize

Design

Experiment

Evaluate & Iterate

Productionize

Release

Hypothesize

Design

Experiment

Evaluate & Iterate

Productionize

Release

Our datasets are used by Fortune 100 companies and research labs that work with speech recognition, translation, synthesis, and conversational AI.

A dataset suite designed for speech-to-speech, multilingual, and voice interaction systems

Converse

Atlas

Chorus

Dialog

How to access our datasets

1. Request samples

2. Purchase access

3. Receive data

Bonus: Experiment with us

Join us to shape the future of audio AI

Updates on our progress

Announcing Our $50M Series B Led by Meritech

Announcing Our $25M Series A Led by Alt Capital

Announcing Our $5M Seed Round Led by First Round

We are an audio data research company.
‍Our mission is to bring AI into the real world through voice, the most important interface to human interaction.