Event Recording: Improving LLMs with Specialized Data

Free Data Test Tool

I am ready for a long road flight for work with a week- or months-long projects.

Free Data Test Tool

Expert Panel Recording:
Nov 28, 2023 — New York

Improving LLM Performance with Specialized Data

For AI builders, including product managers, data scientists, CTOs, and machine-learning engineers.

#Expert Panel

Recorded on November 28th, 2023 New York

Featuring:

Leo Chiang

Chris Guess

Dir. Products, Data Science & Machine Learning, Warner Bros. Discovery

Lead Technologist, Duke Reporters' Lab

Nina Lopatina

Director of Data Science, Nurdle

Darren Oberst

CEO, Ai Bloks to the NYC

0:00 What do you and your organization work on with LLMs?

5:18 What do you need customization for, and what role does data play in this?

11:30 Data governance, data ownership, and the challenges surrounding these.

Questions covered in this expert panel:

0:06 Nina

For running models in production, where there are latency and model size constraints, a lot of the really large performance models aren't going to work. So, you have to specialize the data to train much smaller models. At Nurdle, we use LLMs to create synthetic data. We're using fine-tuning, RAG, or few shot generation based on the specifications of the kind of content that our design partners want. For example, we're doing a pilot with Chris, and he was interested in complaints about banks and the different kinds of fraud, but they could be committing different intents. They can talk about this in a few different languages, so we are creating synthetic data with the text in various platforms and languages. Then, the intents are also labeled, the region is labeled, and the bank is labeled.

Transcript:

Share it

1:09 Darren

We do a lot of fine-tuning of open-source foundational models. Our basic premise is that smaller specialized models in open source that, when they're fine-tuned appropriately, can deliver performance that is state of the art, as good or almost as good as the open AI and the Anthropics, at a dramatically lower cost, that is fit for purpose, and can much more easily be integrated into a company's workflow. So we have fine-tuned models from about a billion parameters to 7 billion, some 13 and 20, that we've actually put a whole collection out into open source, a CPU-oriented set of models, because actually if you look when you're doing something with sensitive data, you actually want to be able to do something on your laptop without sending sensitive company information out to public API. We were still trying to evaluate if this was even going to work. So we built a series of instruct fine-tune models running on a laptop from a billion to 3 billion parameters. So you can do RAG, where all the data, all the documents, and all the embeddings never leave your laptop. So we've done a lot of work with those kinds of smaller l ones evaluating what they can and can't do.

2:33 Chris

We know that LLMs will hallucinate wildly. And so using them as a way to fact check is actually really dangerous. So instead of the generic text, what I tend to use, is the embeddings. Specifically, I'm working on a bunch of research problems around detecting large-scale misinformation campaigns and using the embeddings to link very different sets of claims that might be made by the same group or person but are in different disparate topics that might be coming out. So if you're a Russian troll farm, you may be discussing something about a volcano going off in Iceland. But you also might be talking about Trump to a regular human, and we wouldn't be able to connect those. But using the embedding models, we may be able to group those within a large geometric space, to be able to show those a little bit better.

3:38 Leo

At Warner Bros, we have lots of discovery with video content and a lot of news, sports, and entertainment. So that also comes with text. And obviously, as we're speaking, all of this can be transcribed into text; therefore, you have this entire archive of texts. Not to mention lots of movies and shows have lots of content that's generally online about your reactions to watching Game of Thrones, or what the the plot descriptions, or somebody publishes an article about our IP. That's a lot of text to kind of grab from, from different sources. So one of the very easy low-hanging fruit use cases is to describe, to have a normalized process of describing the content and describing the gamut of various creative forms. Our general philosophy for utilizing not just elements but machine learning is to involve human experts. What we're trying to deploy or trying to experiment with is to try to convert a totally manual process to a more, human machine. The reprocess where the data is fact-checked, and not only that, you are using the ground truth labels to continue to train around the lens or your other machine learning models to continue to perform better, and specifically performed better in your space, your content space, and your data space.

5:20 Nina

At Nurdle, we're creating content and text for people who are often working on classifying user-generated content, which is a very different style from what many of the assistant-like models were trained on. So we ended up needing to fine-tune models in order to be able to create data like that because it's just not something that we can find without fine-tuning.

5:48 Leo

General machine learning techniques can automate video or text editing tasks that were previously done manually. However, if we consider how a ChatGPT was constructed, it is designed to optimize natural language to make it feel like you're having a real conversation with someone. They generate a lot of text, even if it's not necessary, to provide context. On the other hand, when introducing a level of automation, machine learning tasks are very specific and require human verification. Using a large LLM (Language Model) can cause hallucinations and provide unnecessary context. Moreover, it can be slow to run and require expensive in-house GPU resources. These issues can cause problems where the system fails to deliver.

We start to turn to the smaller primary models, and also the open-source ones where rather than using just an open API. You're sending it very proprietary and sensitive data, and oftentimes, the creative data ( the rights don't belong with our company; they belong with the creators.) Those are instances where you don't want to send that over. And it's such that an open API, even with your NDA or whatever agreements are in place. So we find ourselves having to post a lot of these in-house and fine-tune them in that we end up owning the models themselves, as IP is not just the data itself.

8:04 Chris

One of the important things we need to do is look at customer complaint data on social media against, for example, regulated banks. So customers might be complaining that their deposits are not showing up, their remittances from abroad aren't arriving, or their ATMs have stopped working, which can be indicative of something going slightly wrong. But especially when you're a small and new startup like us, that essentially would require manual tagging of every single comment, which is 10s of 1000s, hundreds of 1000s.

So we need things like synthetic data, and that's why we're working with Nurdle on the pilot, so that we can train not just LLMs but other types of machine learning models as well. Because this data is good for not just our labs, but machine learning. Outlines are one very specific part of it. So we can train these models to recognize better and classify comments that we might not have seen without having to hire a ton of people with financial knowledge, unfortunately, because you need that experience to be able to go through and take every single one.

9:25 Darren

What's led to so many of the dramatic improvements that we've seen, both in the models, as well as in the open source community, I think you have to say the biggest contributor to LLMs has been data. And I think it's in three places. First, it's in the base pre-training of that model. Data is everything in the sense that it ultimately gives the model's foundational capability. Second, data is everything when it comes to fine-tuning. It's a data exercise. So once you have your secret recipe, sauce, and secret data, you can get a model to do almost whatever behavior you want if you've designed that dataset the right way. So it's everything for fine-tuning. Then, finally, Data is everything; it's your data. It's a company's data. Most of the interesting work that will be done with elements in the enterprise in one form or another will tie into private knowledge from that enterprise, which works knowledge of that data and, ultimately, the application of a model inference statement to that data. It's really what's what's going to enable all the really interesting stuff with AI?

11:33 Nina

At Nurdle, one of our big concerns in Nurdle is the license of the model that we're using. So depending on what our design partners are interested in, if it's data for fine-tuning a model or for a classifier, then as long as it's a commercial license, that's okay. But if we're just creating data for someone to pre-train their foundation model, we must abide by the licenses. So you can't use raw material anymore in that scenario. Then we ended up having to choose a few different models to work with, which of course, has some stumbling blocks, as you kind of try to try to shift your workflows between them. And so we don't have a lot of other data coming in that we have to worry about. But at Spectrum, it was exactly the opposite. We were moderating 15,000 records per second for six years. Fortunately, we could keep all that data to use as the seed for some of our fine-tuning since it's a wide range of topics and languages. But for that data itself, we had it PII scrubbed before it ever touched any system that anyone at the company had access to.

12:49 Chris

The nice thing is that all the data I work with everywhere is naturally public. The big difference that we have is regulatory across jurisdictions, so all of the work I do, very little of it is based stateside. But when I have to deal with the Philippines and Ghana in the same 12 hours, it gets really complicated and very expensive. Because they have various specific differences in their data privacy, like one doesn't have one, one is restrictive, but only around specific topics. And so that's something that I really like synthetic metaphors because then I don't have to worry about those problems, especially when I'm testing out a hypothesis or maybe not on the full production, but just to see if it might actually work. And then I can start calling up lawyers and various regulators and make sure that what we're doing is kosher.

13:54 Leo

So thinking about PII is obviously with GDPR, and CCPA, which very much is the governance of consumer data. So if you take your digital footprint or purchase history or transaction, your biometrics, those should belong to you, and your government can find you to request to delete it and such. But kind of the flip side of this, because now we're working with LLMs more, more text data, a lot of this text came from kind of images. Also, in all these image models. The data came from creative talents. And that really brings a question about not just who owns the data but who owns the resulting models for training or fine-tune by this data. And very importantly, who gets to profit from this. So all of this has been built through very strict governance and new challenges we were considering in the LLM industry.

As the field of large language models evolves, there is a trend toward developing smaller, more specialized models. This approach is gaining popularity as it offers cost savings and improved customization, allowing for more efficient and effective natural language generation and processing. The transition towards more personalized models brings along some challenges.

At Nurdle’s NYC LLM event, experts from Ai Bloks, Duke University, and Warner Brothers discussed implementing personalized LLMs in their respective organizations. The experts shared their experiences and showcased examples of how customized data can be leveraged by product and data science teams to refine LLMs, making them more context-specific, human-like, and tailored to the specific use cases. The recording also sheds light on the challenges of customization, such as data governance, and offers insights on overcoming them.

Recent Events & Content

Expert Panel

Watch On-Demand

In case you haven’t heard, the future of AI is likely to be smaller, use-case specific LLMs that are great at specific tasks and cost far less to run than prototypes built in large LLMs like ChatGPT. But how do you get them to production?

Productionizing Custom Large Language Models

Recorded on Jan 30, 2024
Santa Monica, CA

Happy Hour

We're excited to connect with data scientists and engineers at the LLM Happy Hour hosted by Nurdle! A networking event to discuss challenges, best practices, and what's next in the world of LLM's.

NeurIPS LLM Happy Hour — New Orleans

December 14, 2023,
New Orleans

Expert Panel

Watch On-Demand

Productionizing Custom Large Language Models

Recorded on Nov 28, 2023,
New York

Expert Panel

Productionizing Custom Large Language Models

Recorded on Nov 16 2023,
San Francisco

Watch On-Demand

Happy Hour

We had a great time discussing challenges, best practices, and the future of LLMs with data scientists and engineers after the AI Conference.

AI Conference LLM Happy Hour

September 27th, 2023
San Francisco

Ready to Join us for our next event?

Check out the schedule