I am ready for a long road flight for work with a week- or months-long projects.

Home
/
Blog
/
Enhancing and Fine Tuning Large Language Models

Enhancing and Fine Tuning Large Language Models (LLMs)

Numerous businesses are excited to release generative AI-powered applications that can provide benefits to both their employees and customers. Thanks to the widespread availability of large language models (LLMs), large language models (LLMs) has opened the doors for innovation — but with a major caveat.

Join the Pilot Program

Susie Stulz

Reading time: 13 min

10.19.2023

The Cold Start Problem Machine Learning

Specialized Large Language Models: Putting Them Into Production

Messaging, Conversational, and Behavioral Data: A Look Ahead for Large Language Model Training

Data Preparation - Data Cleaning and Data Labeling

Challenges with Building Generative AI

LLMs require a tremendous amount of data to function reliably. Unfortunately, obtaining accurately labeled, brand-safe, and cleaned data is an enormous challenge for data scientists. Without these critical components, building or fine-tuning an LLM is simply not feasible. More and more data scientists are realizing that building precise models and fine-tuning LLMs is not something they can accomplish in-house. They need a partner that can help them acquire the dataset necessary to accomplish their company’s generative AI use cases.

In this document, we’ll cover the issues relating to the data needed to build or fine-tune an LLM.

The Sudden Popularity of Generative AI and LLMs

In November of 2022, OpenAI dazzled the world with the launch of ChatGPT, the first application of generative AI to capture the public’s imagination. Today, numerous start-ups are launching generative AI services, with the support of venture capitalists. In 2022, funding for such companies totaled $4.8 billion. In the first five months of 2023 alone, funding topped $12.7 billion.

LLM Use Cases

The use cases for LLMs are very compelling, prompting many executives to encourage their data science teams to pursue them. Some of the more popular requests include:

Smarter Chatbots

The potential for smarter chatbots is one of the most exciting generative AI use cases. Brands are looking forward to a future where customers can engage with a chatbot on their website using natural language, and engage with a chatbot using natural language, providing a bespoke customer experience at scale based on brand-specific data.

However, it is crucial to approach the fine-tuning and testing of the underlying LLMs with great care. The National Eating Disorder Association's chatbot, Tessa, made headlines for providing harmful and dangerous advice to users, highlighting the importance of responsible implementation.

Test an Internal Tool Against Real-World Data

Prior to launching any AI-enabled application, data scientists must test it against real-world datasets. This can present significant challenges, as real-world datasets often contain sensitive information, such as PII data or confidential conversations between a healthcare provider and a patient. More and more data scientists are realizing that building precise models and fine-tuning LLMs is not something they can accomplish in-house. They require a dependable partner who can assist them in obtaining the necessary dataset to accomplish their business's generative AI objectives.

Understand How a Unique Audience Speaks

Language varies by age, region, economic and educational background and a myriad other factors. What’s more, it evolves at a rapid clip based on current events or chatter in the Zeitgeist. Understanding how a unique audience speaks is critical, but getting a dataset that is specific to the audience can stop the launch of an LLM application in its tracks.

Recognize Illegal Activity

A typical brand isn’t likely to see a great deal of illegal activity on its site, but it is possible. For this reason, AI model training is required to recognize illegal activity immediately and refer it to a human for review.

Develop Contextually Aware Models

Context can have as much impact on the meaning of phrases and sentences as the actual words used. AI models need to learn based a full range of conversational context and parse gray areas rather than just categorizing words as “bad” or “good.”

Nurdle AI launched a free pilot program, where they provide “lookalike” datasets that have been custom-tailored specifically for your AI project, especially for the use cases listed above! You’ll get hybrid synthetic data that performs comparably to real data but at a tiny fraction of the cost.

Learn about Nurdles pilot program

A Dearth of Adequate and Appropriate AI Training Data

One major obstacle in achieving the aforementioned use cases is the scarcity of data required to properly train large language models in order to produce precise and brand safe responses.

Large language models training requires a significant amount of data, and it is important that the data used is appropriate for the intended application. For instance, general market LLMs such as ChatGPT and Bard, aren’t suited for highly specific use cases like Healthcare or Trust & Safety. Insufficient data will lead to insufficient training, which can have disastrous effects, as the NEDA example above illustrates. It can also lead to a low level of accuracy and a high variance in responses the LLM generates. ChatGPT itself warns users in its prompt part that it may produce inaccurate information.

This lack of data is thwarting many LLM projects. “We’ve seen lots of pitches from companies who may well be pursuing a brilliant application of AI, but they don’t have access to data that will give them the ability to build a powerful application, let alone proprietary data that will help them have competitive moats in their business,” Brad Svrluga, co-founder and general partner of venture-capital firm Primary Venture Partners, told the Wall Street Journal.

While it can be a challenge, generating synthetic data, data that's artificially generated rather than produced by real-world events, can help overcome some of the limitations of using only real-world data to train large language models. Synthetic data can be used to supplement existing data or address issues such as bias and discrimination.

In order to ensure accurate and brand-safe output from an LLM, data science teams can create a data vault that includes both real and synthetic data, as described in this article.

The Cold Start Problem Machine Learning

The cold start problem refers to the myriad challenges that arise when we introduce a new model that doesn’t have sufficient initial historical data to function effectively. This dearth of data hinders the model’s ability to make accurate predictions, recommendations, decisions, or responses.

The cold start problem in a generative chatbot for a brand website is like the chatbot being new to the job and not knowing much about the visitors yet. When the chatbot is just starting out or doesn't have enough information, it will struggle to understand a user’s questions and is likely to provide an irrelevant answer.

As the chatbot gathers more interactions and data from users, it gets better at providing helpful and accurate responses. The cold start problem, therefore, focuses on the initial challenges a chatbot faces when it's not familiar with the users' needs and preferences.

The objective function is a formula that guides a model so that it can make good decisions and learn from data. Specifically, the objective function tells the chatbot how to understand and respond to their questions. Even with just a few conversations, the chatbot starts to learn what works well and what doesn't.

Over time, as more people chat with the chatbot, the objective function helps it improve by finding patterns in the conversations. This way, the chatbot gets better at helping visitors, even if it starts with little knowledge about them.

This is how the objective function helps the chatbot tackle the cold start problem by guiding it to learn from the interactions it has, making it smarter and more helpful as it gathers more data.

But not every AI use case has the ability to wait for an application to gather the right amount of data to self-learn. They need a solution to the cold-start problem sooner, in order to ensure their chatbots or AI models respond appropriately from day one.

Addressing the Cold Start Challenge

As a data scientist, one understands that machine learning models are only as good as the data that trains them. Traditionally, machine-learning models are trained with “real” data, meaning that is obtained from real-world sources and reflects the patterns and characteristics that exist in the world. The inherent patterns and relationships in real data are direct reflections of the actual phenomena that AI models aim to decode. This authenticity ensures that AI models with real-world scenarios, making real data indispensable for AI model training.

Large language model training with real data

Collecting real-world data is a valuable source of knowledge for AI and machine learning models. The authenticity of this data provides a lifelike complexity and richness that aids their learning and evolution. Real data's inherent patterns and relationships reflect the real-world phenomena that AI models strive to decode, and its incorporation of spontaneous noise, outliers, and unexpected patterns found in real-world situations is essential in AI model training to predict and respond accurately to a wide range of scenarios. Therefore, real data is indispensable for their training.

A great example of this can be seen in branded chatbots. If your target market is teenagers, a chatbot trained on real data from pop culture and youth-oriented online communities will provide more authentic, applicable, and brand-safe responses to your audience.

One of the key difficulties with real-world data is that a lot of companies simply don't have enough of it to effectively train large language models. In order to create large language models, a huge amount of AI training data is needed, and the more data there is, the more precise the model will be.

Furthermore, using real data can be both time-consuming and expensive, often costing tens of millions of dollars. However, new technologies are helping to bring these costs down, as we'll explore shortly.

More concerning, real data can be highly biased, leading to outputs that are inaccurate or skewed. “If you’re building a model from scratch, you know what you’re feeding it,” Databricks Chief Executive Ali Ghodsi told the Wall Street Journal. Off-the-shelf models, which are ready to use because they have already been trained on internet data, are filled with extraneous information that can skew results, Ghodsi said. Many companies are cautious about sharing their data in models built by outside vendors due to privacy and security concerns, he said.

Real data can also contain sensitive or private information, which means to use it to train large language models, it requires the data science team to comply with ethical guidelines and privacy regulations. Recently, the Biden Administration announced it received commitments from the top seven AI companies to guard against the risks posed by AI, including the potential to expose sensitive information, among other issues. Therefore, all personally identifiable information (PII) must be removed from real data before using it for training.

Using synthetic data to train large language models

Synthetic data can significantly reduce the cost of obtaining data to train large language models. It achieves this by both supplementing real data, as well as addressing its shortcomings. Synthetic datasets have been used to train models since the early 1990s.

Synthetic data is data that’s generated artificially through simulations, algorithms, or other methods, and is designed to resemble the properties and statistical features of real data. According to Gartner, 60% of all data used in AI applications will be synthetic by this time next year. In other words, synthetic data will account for the majority of data used for AI development.

The most significant advantage of synthetic data is its abundance and flexibility. Since it's computer-generated, synthetic data can be created in vast quantities and designed to reflect a wide range of scenarios that may be difficult, expensive, or ethically complicated to collect from reality. This makes synthetic data an ideal solution for the shortage of LLM data.

In addition, synthetic data addresses privacy concerns that arise when working with real data that often includes sensitive personal information. Synthetic data, on the other hand, contains no personal details, thereby reducing privacy concerns.

Moreover, synthetic data can also eliminate many of the biases inherent in real-world data, ensuring that LLMs don't discriminate or perpetuate those biases. With these advantages, synthetic data is an indispensable tool for researchers and data scientists in various fields.

When it comes to LLM applications, a significant goal of the model is to generate outputs that sound genuinely human. But, synthetic data poses certain challenges. As we move to specific applications, such as a chatbot that resides on a brand’s website, synthetic data can’t produce outputs that match the brand’s voice as it pulls data from the open web. In the Trust & Safety world, we’ve seen people couch toxicity and threats in highly creative ways, and it is difficult to generate synthetic data that can catch this level of creativity.

Additionally, synthetic data is created with algorithms that pull from the open web. As a result, it does not retain many of the advantages of well-vetted real data, and should never be used on its own to train large language models.

Although generating synthetic data offers many advantages, it can never be a complete replacement for real data, due to its inability to capture the subtle complexities and unforeseen patterns of real-world data, limiting models trained solely on synthetic data.

Learn More: Data Training Strategies for AI Models

Specialized Large Language Models: Putting Them Into Production

In the dynamic field of artificial intelligence, integrating specialized Large Language Models (LLMs) into production is a crucial step for businesses aiming at seamless functionality.

Understanding the LLM's Potential
The integration process begins with a thorough analysis of the LLM's capabilities, aligning its strengths with the unique needs of the production environment. This initial stage includes a deep dive into computational resource assessment, data input analysis, and identification of relevant use cases.

Rigorous Testing and Optimization
After laying the groundwork, the LLM undergoes rigorous testing in a controlled environment. This phase is about ensuring seamless functionality, fine-tuning parameters, and optimizing algorithms to enhance accuracy and efficiency.

Strategic Deployment
Successful testing sets the stage for deployment, where careful integration into the production infrastructure takes place. Considerations for scalability, reliability, and real-time processing ensure the model meets technical standards.

Continuous Monitoring and Adaptation
In the ever-changing landscape of technology, continuous monitoring, and feedback loops are established to adapt the LLM to evolving requirements. Strategic optimization ensures the model remains aligned with the latest trends and standards.

Empowering Users
Comprehensive documentation and training materials are developed to facilitate a smooth integration process, empowering users to leverage the specialized LLM effectively.

In conclusion, the integration of a specialized LLM into production is a holistic journey that demands a meticulous blend of technical expertise. By approaching this process comprehensively, businesses can unlock the full potential of their LLM, achieving seamless functionality in their production environments.

Learn More: Putting Specialized Large Language Models Into Production

Messaging, Conversational, and Behavioral Data:
A Look Ahead for Large Language Model Training

Data scientists have access to various resources for training an LLM. Open-source datasets are available for both conversation datasets and behavioral data. Conversation datasets consist of freely available data showcasing dialogues or conversations between individuals. These datasets are used to train machine learning models for conversational AI systems, enabling them to understand and generate human-like responses.

For instance, Nurdle’s data vault has billions of rows of messaging data from gaming, social media, gaming, and e-learning platforms. The data can be grouped and labeled by demographic, category, or to train machine learning models.
 
The data has been scrubbed of any personally identifiable information (PII) and categorized according to toxicity levels. This allows for a machine learning model (LLM) to be trained to alert customer service when a customer uses inappropriate language while interacting with a brand's chatbot.

Behavioral data refers to the actions and patterns of behavior displayed by individuals and groups. This data is collected from various sources, including online interactions, user activity logs, sensor data, and social media posts, and is used to train machine learning models that can predict human behavior. The data provides valuable insights into how people behave, make decisions, and interact with their environment. Online retailers can use behavioral data to increase conversions by suggesting similar products that may be of interest to their visitors.

Advantages & Disadvantages of Conversation Datasets

The obvious benefit of using conversation datasets to train large language models is that they exist, they’re often large, and they’re readily available. This spares your company the burden and financial costs of collecting and annotating datasets internally.

Many of the conversation datasets contain plenty of examples of varied and realistic speech, which can improve the accuracy of LLM models. Certain datasets, such as Nurdles, excel in capturing the inherent relationship between context and response. This relationship aims to comprehend how sentences or phrases are interrelated in a conversation. By comprehending this relationship, Language Model Machines (LLMs) can better understand the context of a conversation and produce responses that are contextually relevant. Nurdle was constructed using contextual AI. To develop contextual AI, our LLMs were trained on conversation data.

Finally, these datasets have been deployed by other companies, so they’re market-tested. When you use an open source dataset, you can benchmark and compare the performance of your LLM model against established baselines and state-of-the-art models.

Conversation datasets, such as the Cornell Movie Dialogs Corpus, can have a specific domain focus. This dataset, in particular, was trained on movie scripts, and real people don’t necessarily speak the way actors do in movies. This level of specificity may limit your LLM’s ability to respond in other scenarios.

Many of the conversation datasets are static, yet language evolves on a continuous basis.

Phrases that may seem innocuous on the surface may have evolved to take on a sinister meaning in the Zeitgeist. This can lead to brand safety issues.

Although conversationl datasets are quite comprehensive, they can still be limited in terms of representing diverse demographics, cultures, or language variations. As a result, the trained model can be affected by biases or skewed representations.

It's important to note that datasets that contain real conversations from social media posts or Reddit may contain Personally Identifiable Information (PII) and can violate privacy and security regulations, as is the case with real-world data.

The Importance of AI Model Training with Conversational Data

Conversational data is data that has been extracted from digital conversations in a wide array of communications channels, such as online forums, social media platforms, reader comments, customer service interactions, email and so on. It creates structured data out of vast volumes of unstructured data. This data is used to participate in dialogs by predicting the next response.

Conversational data can be active or passive data. Active data are the prompts a user enters (e.g. “do these shoes come in red?”). Passive data is collected without explicit requests, and includes things as the timestamps on each interaction.

Conversational data is data that has been extracted from digital conversations across various communication channels, such as online forums, social media platforms customer service interactions, email, and more. This type of data is unstructured and voluminous, but it can be transformed into structured data.

Conversational data is used to engage in dialogues by predicting the next response. Conversation datasets is classified into two types: active and passive data. Active data includes the prompts that a user enters, such as "Do these shoes come in red?" Passive data, on the other hand, is collected without explicit requests and includes details like timestamps on each interaction.

LLMs can be used for a wide range of applications, but if you want to use it for things like chatbots and virtual agents, your LLM must be trained on conversational data.

Conversational data is unique in many ways. First, it captures the nuances of human language, including the way we phrase things, filler words (think of all those “likes” we interject), slang and other variables. It focuses on the flow and structure of conversations, rather than just individual words or sentences.

Conversational data can also provide context by capturing the back-and-forth interactions between people. Interactions provide important clues, including the sequence of text responses. This contextual understanding is ultimately what allows for meaningful and relevant responses.

It is important to be aware that conversational data, which is essential for building generative AI chatbots, can contain biases that might impact the performance of the models. Therefore, it is necessary to adjust for these biases before deploying the chatbot to ensure its accuracy and fairness.

Data Preparation - Data Cleaning and Data Labeling

Data cleaning and data labeling are critical steps to train large language models. They’re also the most time consuming and least enjoyable tasks on the data scientist’s plate.

Data Cleaning

Data cleaning is the process of identifying and removing inaccurate, incomplete, or irrelevant data from your dataset. This process ensures that your data remains accurate and reliable for analysis and modeling purposes. Data cleansing encompasses a wide variety of tasks, from handling missing values, removing noise, dealing with duplications, validating data and removing inconsistencies.

Why invest time and resources into data cleaning?

When you focus your LLM on relevant and high-quality data for the task at hand (e.g. help a website visitor select a product that’s right for her), it can learn more efficiently and apply its knowledge more effectively to new, unseen inputs. Without cleansing, the model's output may not be trustworthy or dependable.

What is PII scrubbing and why is it important

In April 2023, OpenAI made headlines when it revealed that sensitive consumer data, including credit card numbers, were exposed to other users. The exposure was due to a glitch that OpenAI fixed, but other security issues can occur if PII data isn’t scrubbed from the all datasets.

As a matter of course, datasets should neither collect nor store PII data. It’s a good idea to work with a third-party that will remove all PII prior to importing it as part of a dataset to train an LLM.

For instance, pseudonymization is a process that replaces every piece of personal data with an artificial identifier or pseudonym. To keep PII data out of AI training data, it should be sent to a third-party that masks any data in order to ensure the LLM is never trained on data that can contribute to the identification of a user.

Personally identifiable information (PII) in an AI training dataset can have immense risks, including violating consumer privacy rights outlined in GDPR, CCPA, and other emerging data privacy regulations.

What is Data Labeling

Data labeling is a critical component of data preparation for any machine-learning application. Data labeling involves identifying, describing, and categorizing specific elements in the data to train large language models. Data labeling involves identifying, describing, and categorizing specific elements in the data to train large language models. The labeled data serves as a reference for the model to process, predict, and respond to real-life data more accurately and efficiently. In addition to improving the model's performance, data labeling also helps to identify and correct errors or inconsistencies in the AI training data. Human validation of the labels is especially helpful in ensuring that the labels align with the desired model output.

Labeling data ensures that the data is annotated consistently, following predefined guidelines or standards. Consistent labeling is essential for training LLMs, as it helps the model understand the desired output and learn the correct associations between input and output.

When you clean and label data, it gives you valuable insights into the characteristics and properties of a dataset. This understanding helps you to analyze the data distribution, recognize any biases or anomalies, and make informed choices regarding data preprocessing and AI model training. As a result, the LLM becomes more accurate, dependable, and efficient in its language understanding and generation capabilities.

Steps include:

Learn More: Data Preparation with Nurdle AI

Fine Tune LLM with Featurized Data and Normalized Data

Featurization, also called feature engineering, is a method used to improve machine learning models by adding new data to a dataset. It is a process that transforms raw data into a format that can be efficiently utilized as input for machine learning algorithms. Essentially, featurization helps to convert text and images into numerical features that capture relevant information for the learning task.

There are a few reasons why you should featurize your AI training data. First, it allows for more efficient processing. By converting the data into a more compact and structured representation, models are able process and analyze the data more efficiently.

Featurizing data lets the model understand inputs better. Rather than deal with raw text directly, the model extracts important features or patterns from the data, enabling it to capture deeper meanings and contexts of the information. This process improves the model's ability to grasp the nuances and improve responses.

Normalization is a process in which data is put into a standard format so that it can easily be compared, analyzed and processed. It involves applying various transformations to the data so that it is more consistent and suitable for your LLM’s tasks.

Some common techniques include:

which eliminates inconsistencies in capitalization and ensures that similar words are treated the same way

so that the model can focus on the textual content that’s relevant to the task at hand

which is the process of splitting the input and output texts into smaller units that can be processed by the LLM AI models

Converting all text to lowercase

Removing punctuation marks

Tokenization

“Stemming is a process that stems or removes the last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.” (source)

“Stop” words are those that don’t carry significant meaning in a specific context. Examples of stop words include “and” and “the.” We remove them to create more efficient and faster data computations

Stemming & Lemmatization

Stop words removal

Data normalization helps LLMs to better understand the underlying patterns and relationships within the text, ultimately leading to improved performance in fundamental tasks of LLMs, such as language generation, and sentiment analysis. Normalization also helps to reduce the dimensionality of the data -- aka the number of input features or variables that the model considers while processing and learning from the data. This, in turn, makes the data more manageable for processing and analysis.

Learn More: Fine-Tuning with Nurdle AI Data

Challenges with Building Generative AI

It's not difficult to launch a generative AI application, such as a chatbot. The hard part is ensuring it provides accurate and brand-safe responses to users. Before you embark on such an endeavor, you should know the challenges and expenses you will face.

A Lack of Data & The Cold Start Problem

We already spoke about the cold start problem, and the need for sufficient volumes of data. But how much data is required? The amount can vary depending on several factors, but in general it’s a lot. LLMs typically have at least one billion or more parameters, which require substantial amounts of training data. LLMs are trained on massive datasets, known as corpuses, which are measured in petabytes of data.

Additionally, the sheer size of an LLM and its AI training data requires a massive infrastructure encompassing large clusters of accelerated machine learning instances for training, which is as expensive as it sounds.

High Cost of Data Labeling

The cost of labeling a large dataset is astronomically high, as it requires expertise in the language used by the target audience (see Understanding How an Audience Speaks above), significant manual labor to collect and build the dataset, applying data science resources to ensure the data is of sufficient quality to answer the task at hand, as well as the cost of removing PII data.

The more complex the language, the higher the costs. The cost for a single annotation can range from $0.015 to $0.08. Multiple this by several million annotations and we can see how quickly these costs add up.

Privacy Compliance

When consumers interact with brands, they willingly provide an array of PII data, including their names, email addresses and other contact data, account numbers and other sensitive data. It is essential that none of this data ends up in the dataset used to train large language models; if it does, this sensitive and PII data may be exposed to other users, in violation of GDPR, CCPA and other privacy regulations, as well as the White House’s AI Bill of Rights.

The fines for violating existing privacy regulations can be steep. For instance, GDPR has two tiers of fines, depending on the severity of the violation. The less severe infringements can result in a fine of up to €10 million or 2% of the company's annual global revenue from the preceding fiscal year, whichever is higher. More serious violations can result in a fine of up to €20 million or 4% of the company's annual global revenue from the preceding fiscal year, whichever is higher.

Data Hallucinations

We’ve all read instances of ChatGPT hallucinations:

Georgia radio host, Mark Walters sued OpenAI when he learned that ChatGPT was spreading false information about him and accusing him of embezzling money.

Steven Schwartz, an attorney representing a plaintiff in a lawsuit, submitted a ChatGPT-created brief to a court that cited cases ChatGPT had hallucinated.

NBC New York reporters said that when researching a story on Michael Bloomberg, “ChatGPT seemed to make up entirely fabricated quotes from phony anonymous sources.” One fictitious source claimed Bloomberg said, “’It’s not about giving back, it’s about buying influence.”

What are LLM hallucinations? These are instances where an LLMs generates text that is factually incorrect, nonsensical, or not real. (Note: ChatGPT warns users of its propensity to hallucinate below it’s prompt bar.)

These hallucinations occur for a variety of reasons, such as incomplete or contradictory AI training data, overfitting (i.e. instances where a model performs extremely well on the AI training data but not on new, unseen data, or vague prompts).

Hallucinations can also occur based on the varied datasets used to train the model. The model may make associations between specific words or phrases and certain concepts, even if those associations are not entirely accurate or overlay specific.

The types of hallucinations described above occur when a model generates content that seems contextually appropriate but is, in fact, false information. Ryan Treichler, Spectrum Labs VP of Product Management, explained to AdMonsters that LLMs are word calculators that are really good at predicting what the next word should be.

Data Retention

Data minimization is the foundation of our privacy policy. We accomplish this through a process known as pseudonymization, which essentially replaces every piece of personal data with an artificial identifier or pseudonym. In our case, user data is sent to a third-party that performs the masking for us, which means that Spectrum never receives, stores or processes any data that can contribute to the identification of a user.

The second step in Spectrum’s privacy protection design is the de-identification of the data saved by Spectrum. This process is done by ID replacement, using a keyed hash function. The key is managed in AWS' KMS (key management service), and deleted no less frequently than every 30 days. This keyed hash enables data controllers the ability to take necessary action on individuals in the short term while protecting privacy long term.

Nurdle AI

Nurdle creates synthetic data based on our real-world data vault and your own real-world data. By combining both, Nurdle provides synthetic data that is brand-safe, personalized to your customers, privacy-compliant, and scrubbed of any toxicity. We can help you scale high-quality datasets for better performance.

Nurdle’s data vault is built on millions of rows of real-world conversational data from online gaming, dating, social media, and e-learning platforms. Our datasets have been cleaned of toxic content, scrubbed of personally identifiable information (PII) for GDPR compliance, and come accurately labeled for immediate use.

New Data Source for your Use Case

Spectrum Labs offers data to train your proprietary generative AI chatbot. This dataset begins with data collected from multiple datasets, including scraped data, public datasets, purchased data and your own data.

Next that data is thoroughly cleansed and scrubbed. All PII data is been removed, and any bias, child safety concerns and brand-safety concerns have been identified.

Once it is scrubbed, we generate synthetic data, using customer data for analysis. Spectrum then leverages LLms to create synthetic data sets for DaaS customers.

Meet with one of our data experts to unlock Nurdle's scalability for data creation, preparation, and measurement

Contact Our Team