I am ready for a long road flight for work with a week- or months-long projects.
Contact us
Free Data Test Tool
Fine-Tuning Data
Custom, high-quality, privacy-compliant datasets at massive scale that you can afford
Free data testing tool
Find out about its label bias, clustering, skew, and more…
Save hours of time figuring out what data you're missing with Nurdle's free data-testing app. Run locally or in Google Collab without sharing your data.
Details Here
Use Custom Nurdle Datasets for
LLM Instruction Tuning & Reinforcement Learning
Generative LLMs (Chat & semantic search)
RLHF (Reinforcement Learning from Human Feedback) is the most effective way to train LLMs to behave within policy... but new data science techniques show that AI-created datasets can be just as effective - and far less expensive - to deploy.
Fine-tune your chatbot for brand or character voice or for product, industry or brand expertise. Fine-tune enterprise AI search with domain-specific expertise so it knows what you mean... even when you’re not quite sure how to ask correctly.
The question we hear more than any other seems pretty straight-forward on the surface... until you actually try to generate the size and variety of human-quality conversational data required for fine-tuning and model training.
NurdleGPT was built from one of the world’s largest data vaults of human-to-human text, chat, ratings, and post communications, giving it a significant advantage in generating unstructured text that is almost identical to how people actually communicate with one another.
5x more diverse unstructured text data than ChatGPT — with the same quality
Why not just use ChatGPT to make synthetic datasets?
When humans chat, post, and message each other in real life, they’re short and to-the-point. We use shortcuts like bad punctuation, mis-spelled words, and slang – or even emojis – to replace whole words and sentences.
Large Language Models like ChatGPT are a bit more verbose, which is not helpful if you’re training a model to detect behaviors in real human communications.
NurdleGPT was built from hundreds of terabytes of real human-to-human communications so its output looks, sounds, and is even the same format and length as real human text.
The most similar text structure to human-generated text
Cheaper is Better. The future of AI is small, specialized LLMs that are cheap to run
ChatGPT is great for prototyping an AI project, but the compute costs to run it is 100 - 500 times more expensive than smaller LLMs that have been fine-tuned for specific use-cases.
This example by AnyScale shows performance differences for a specific task (summarizing emails) before and after task-specific fine-tuning.
Train open-source LLMs quickly and inexpensively with 100% privacy-safe Nurdle data.
Why Nurdle?
Methodology
1. Dataset Generation We generated a dataset consisting of 5,000 rows of customer-specific data. This dataset was carefully crafted to capture the diverse range of language patterns and topics relevant to our customer's needs.
2. Fine-Tuning and Comparison We utilized NurdleGPT, a second-generation Lookalike Data generator, along with two first-generation synthetic data generators, to create datasets for comparison. Additionally, we used a subsampled dataset of 5,000 rows for fine-tuning purposes.
3. Evaluation Our evaluation process involved analyzing the results for precision and recall with experienced data labeling teams. We also employed metrics such as Self-BLEU scores and Earth Mover's Distance (EMD) to assess the diversity and distribution matching of the generated data.
4. Cost Analysis Finally, we conducted a cost analysis to compare the expenses associated with each approach. This included estimating the cost per 5,000 rows of synthetic data generated by different models.
Methodology
1. Dataset Generation We created datasets with NurdleGPT and ChatGPT, each comprising 5,000 rows of synthetic text data, aiming to mimic real human-to-human communications.
2. Text Structure Analysis We assessed the text structure similarity to human-generated text by analyzing the output of both models, considering shortcuts, punctuation, misspellings, slang, emojis, and overall clarity.
3. Comparison Metrics We used metrics like Self-BLEU scores and Earth Mover's Distance (EMD) to gauge the resemblance between synthetic data from NurdleGPT and ChatGPT and real human-generated text, aiming for lower Self-BLEU scores and closer EMD values to the gold standard.
4. Cost Analysis We also compared the expenses of generating synthetic data using NurdleGPT versus ChatGPT, factoring in input/output token costs and the total cost per 5,000 rows of synthetic data.
How Nurdle can help
Cold Start Datasets
Get your project off the ground with the custom dataset you need to start model building and iteration.
No data? No problem. If you can specify what you need we can make it.
You’ve got data... but who can afford to clean and label it? Problem solved.
Got lots of data but notallowed to use it? Nurdle data mimics yours and is 100% privacy-compliant.
Improve your customer chatbot experience with diverse datasets to cover edge-cases and persona-based voices synthetically created from billions of real conversations.
Get persona-specific message datasets to tune your product’s voice quickly.
Fine-tune sales and support chatbots based on terrabytes of real marketplace interactions for better performance.
Use 100% privacy-safe conversational data to fine tune generative AI applications without compliance risk.
Running a small LLM that performs a specific task well is far less expensive than paying inference costs on large, general-purpose LLMs. Nurdle can provide the data needed to train it.
Reinforcement Learning faster, cheaper and with more control
New techniques show that RLAIF can achieve the same performance as RLHF without the high cost and project delays of human-created preference pairs. Get synthetically-generated preference pairs on demand from Nurdle allows you to customize datasets to avoid naturally-occurring issues such as subjective opinions, suggesting competitors, making deals or discounts for a customer that are invalid, etc.
Data Gap Analysis
Find out what data you're missing for better performance. If you're not sure what data you need for improvement, Nurdle can analyze your data for you.
Save hours of time figuring out what data you're missing with our free Nurdle data test tool.
Label bias, clustering, skew, and more..
Run locally or in Google Collab without sharing your data.
Justin Davis
Co-Founder and CEO
"Nurdle has been used for 6 years by Spectrum Labs to parse billions of online human interactions.
We've used Nurdle data to moderate content for Riot Games, Grindr, The Meet Group, Together Labs, and other gaming, dating, and social media platforms."
Apply for Nurdle’s Free Pilot Program
Available for a select group of companies.
Data Gap Analysis Report
Preparation of Unstructured Datasets
Augmenting Existing Data into Fine-Tuning Datasets
Identify what kind and how much data is missing from your dataset to increase the accuracy of your LLM.
PII scrubbing for GDPR and HIPPA compliance, cleaning, and labeling.
Use cases include (but are not limited to) conversational LLMs, Q&A LLMs, and training your LLM in multiple languages.