Conversational Synthetic Data by Nurdle

Nurdle has been acquired by Duco

Learn more about Duco

Contact us

Free Data Test Tool

Contact us

Free Data Test Tool

I am ready for a long road flight for work with a week- or months-long projects.

Contact us

Free Data Test Tool

Conversational Synthetic Data

Get AI training data faster, cheaper and easier to

|

We generate labeled, text-based conversational data for AI

What makes our data so special?

We have unique expertise in unstructured text from the trillions of conversations we’ve collected, labeled, and stored.

Get a free sample now

We’ve generated data for training across many verticals:

Global Bank regulator

needed examples in Spanish and English of customer communications that mention fraudulent activity on their account.

Learn more

German media company

needed examples of comments on social posts with the following labels: offensive, scam, political, opinion.  

Learn more

AI startup

needed fine tuning data of how a dog would answer questions from a human.

Learn more

Financial services company

needed examples of marketing emails with labels for the intent driver of that email.

Learn more

Content moderation company

needed examples of hate speech and racist comments in multiple languages to improve their classifiers.

Learn more

Food Delivery app

needed reviews that are labeled as to whether they are about an item, a restaurant or a driver.

Learn more

contact us

Schedule time to discuss your use case and get sample data

How we do it

We produce high volume lookalike data (labeled or not); use your data to test it

Nurdlized Datasets

We produce high volume lookalike data (labeled or not); use your data to test it

Nurdlized Datasets

4

We detect ideal data clusters and what data is missing for your use-case

Data Gap Analysis

We detect ideal data clusters and what data is missing for your use-case

Data Gap Analysis

3

We compare yours with our pre-labelled LLM data vault

Nurdle Data Overlay

We compare yours with our pre-labelled LLM data vault

Nurdle Data Overlay

2

Yours or ours - as few as 50 rows

Real Data Sample

Yours or ours - as few as 50 rows

Real Data Sample

1

We produce high volume lookalike data (labeled or not); use your data to test it

Nurdlized Datasets

We detect ideal data clusters and what data is missing for your use-case

Data Gap Analysis

We compare yours with our pre-labelled LLM data vault

Nurdle Data Overlay

Yours or ours - as few as 50 rows

Real Data Sample

0

4

3

2

1

Get a free sample

The question we hear more than any other seems pretty straight-forward on the surface... until you actually try to generate the size and variety of human-quality conversational data required for fine-tuning and model training.

NurdleGPT was built from one of the world’s largest data vaults of human-to-human text, chat, ratings, and post communications, giving it a significant advantage in generating unstructured text that is almost identical to how people actually communicate with one another.

5x more diverse unstructured text data than ChatGPT — with the same quality

Why not just use ChatGPT to make synthetic datasets?

Methodology

When humans chat, post, and message each other in real life, they’re short and to-the-point. We use shortcuts like bad punctuation, mis-spelled words, and slang – or even emojis – to replace whole words and sentences.

Large Language Models like ChatGPT are a bit more verbose, which is not helpful if you’re training a model to detect behaviors in real human communications.

NurdleGPT was built from hundreds of terabytes of real human-to-human communications so its output looks, sounds, and is even the same format and length as real human text.

The most similar text structure to human-generated text

Humans don’t really chat like ChatGPT

Methodology

Cheaper is Better. The future of AI is small, specialized LLMs that are cheap to run

ChatGPT is great for prototyping an AI project, but the compute costs to run it is 100 - 500 times more expensive than smaller LLMs that have been fine-tuned for specific use-cases.

This example by AnyScale shows performance differences for a specific task (summarizing emails) before and after task-specific fine-tuning.

Train open-source LLMs quickly and inexpensively with 100% privacy-safe Nurdle data.

Why Nurdle?

Get a free sample

Justin Davis

Co-Founder and CEO

"Nurdle has been used for 6 years by Spectrum Labs to parse billions of online human interactions.

We've used Nurdle data to moderate content for Riot Games, Grindr, The Meet Group, Together Labs, and other gaming, dating, and social media platforms."

Get a sample

Try Nurdle’s synthetic data

Ready to get your labeled, synthetic data sample?

FAQ

Sorry, but no! Our core product is the synthetic data we generate, which will come to you labeled and formatted for training. Once we’re working together we’d be happy to advise you on other labeling solutions for your historical data.