Event Recording: Customization of Production LLMs

Free Data Test Tool

I am ready for a long road flight for work with a week- or months-long projects.

Free Data Test Tool

Expert Panel Recording:
Jan 30, 2024 — Santa Monica

Improving LLM Performance with Specialized Data

For AI builders, including product managers, data scientists, CTOs, and machine-learning engineers.

#Expert Panel

Recorded on January 30th, 2024, Santa Monica

Featuring:

Nina Lopatina

Head of Innovation, Nurdle

Othmane Rifki

Principal Applied Scientist, Nurdle

Blake Hunter

Sr. Director of Data Science, DRINKS

Ali Hassanzadeh

Senior Applied Scientist, Spotter

Elizabeth Hutton

Senior Machine Learning Engineer at Cisco

0:00 What are the kinds of systems that you are working on?

4:18 What kinds of data shortages have you faced in your work and how have you overcome them?

11:09 How do you define synthetic data and how do you benchmark it?

15:47 Where do you see the future of LLMs headed?

Questions covered in this expert panel:

0:00 What are the kinds of systems that you are working on?

0:06 Elizabeth

For the most part, I work on NLP-related features. So the products that you are probably most familiar with are WebEx meetings and the Contact Center. And there's a lot of NLP use cases there. Some of the more recent projects that I'm working on using LLMs are summarization, action item extraction, automatic chapterization for meetings and for Contact Center calls.

Transcript:

Share it

0:36 Othmane

For me, my expertise is primarily in language processing, either in text or audio. More recently, I've been mainly focused on semantic search, which is essentially the capability of finding content similar to some input that you get. And this is kind of similar to what Google does, but instead of relying on keywords, our current models have the ability to understand language at unprecedented accuracy, as you probably have experienced as you play a little bit with ChatGPT in your life, it actually can understand what you say. Obviously, it cannot, but it gives the impression that it does. So, we’re trying to take these capabilities and the technology and models behind it, and try to seal them up to be able to search for data in scale, more or less. It has been a very interesting, challenging problem that I’ve been tackling in recent years. And before that, at Spectrum, we worked on content moderation. And the centerpiece of that was really trying to understand User Generated Content. Trying to understand what type, what is their intent, “is a message safe or not?,” and trying to detect really, what type of classifier that message is to see what type of content or whatever behavior. “Is that positive or negative?” type of thing, and kind of going down that path.

2:12 Blake

So at DRINKS, we started off looking at wine. Wine is one of these unique products that the stuff inside the bottle only tells part of the story. It’s also about the place it’s from, it's about the varietals, it's about the tasting note, it’s about how it's aged… there's a huge story that goes behind every single bottle, and when people buy wine, it's a really, really complex process. So my team has been trying to focus on trying to understand that process. So you might have a bottle of wine, it's like really, really successful in the US from California, and it's the exact same grapes, you get, let's say, in Italy, they have very similar tasting notes, it might come from the same varietal, but that same bottle might be really successful in California, you get that same bottle in Italy, or from Italy, but that same label on another Italian model, and no one's gonna buy it. You have different expectations from somebody from Italy than you do from California. Similarly, for tasting notes, you might have a tasting note that might appeal to some customers, but might be turned off to other customers. If I call a Chardonnay oaky, a lot of customers might say, “Oh, I don't want an oaky Chardonnay.” If I said that that same chart name was actually aged or barrel-aged, you might say “Oh, I actually like barrel-aged Chardonnays, I just don't like oaky Chardonnays.” I tried to describe that exact same tasting note, but different customers are getting turned on or turned off by that. So my team tries to understand how you might use different languages to describe the exact same thing to different types of customers. Use different types of labels to appear to one customer versus another customer. There's all these different aspects, you have different expectations for different price points from different regions, there's many different combinations and that’s a lot of what my team has been focusing on.

3:50 Ali

The main kind of thing to try to do, overall in general, is consider what components make a video, a movie or a YouTube video or anything, interesting? It has to be music, right? Transcript, the video itself, the narration, novelty of the idea, and creativity, of course. Any maintenance, plus many, many other things.

4:18 What kinds of data shortages have you faced in your work and how have you overcome them?

4:23 Ali

So in order to train or actually fine tune the model to understand this kind of edge cases, we don't have enough real data that’s actually labeled, so we created, for example, I believe 50 to 100 protagonist images in the forefront and the background and AI basically labeled and said “this is the protagonist” in the prompt and basically start fine tuning it. And that's really worked well. So, synthetic data, it's going to be the future.

5:03 Blake

We work with a lot of really small wineries that might not have the capability to produce hundreds and hundreds of different labels out there, testing with large focus groups and understanding what's effective, what's not effective. A lot of small wineries might work with the neighborhood elementary school, maybe there's a child who like drew a picture, and it's like, “oh, this is great pitcher of wine drawn by the local fifth grader, we're gonna put that on our label, because that's what everybody in town knows!” but that label might not be very successful to a larger audience. So if you're trying to sell across the nation, you don't necessarily know what's effective, what's not effective. So for us, we generate a lot of labels for partners to try to figure out like, what's effective, what's not effective, we know what might work really well, for certain groups that you might be targeting. We've been telling more of your story, you think your story is being told for your label, because well, somebody from my community that created this thing, but it might not resonate the exact same way. So we try to understand that and try to put those into synthetic labels. Image generation was always really, really difficult when we first started. It was like, very, very clunky. And all of a sudden we have Midjourney coming out, we have DALL-E coming out, and all these different image generation softwares that can generate hundreds and thousands of different images, with just simple text as an input. And that's revolutionized what we've been working on.

6:22 Othmane

We used to deal with trying to detect toxic content in different kinds of contexts. And one of the main problems, or challenges, that we had was ‘how do you source data that is highly specialized, very specific behaviors, at large quantities that you can train your models on?’ And what we realized really impacted the turnaround lifecycle of producing solutions by a significant amount of time, trying to find the right data, label it, check for its quality, and then train your model. Actually training the model becomes the easy part, the hard part is actually finding the right data to train your models that is of high quality recorded labels. And that's kind of the motivation of Nurdle here is if we ran into that problem at Spectrum Labs, then many, many other companies ran into the same problem of not just getting the correct data that you need for your target application. And, and so in my experience, having the ability to source data that you can specify criteria for, and sourcing it is one thing, but then labeling it and testing it, determining the quality of it, in a cheap and fast way, is really key to try to turn the crank for generating models and build solutions that power them. And that's something that I'm sure that many, many companies have struggled with, as we did back when we were at Spectrum Labs.

7:56 Elizabeth

You would think working at Cisco, I'm just swimming in data, right? Got millions of meetings, millions of Contact Center calls every month, thousands of customers. Must just be swimming in data. I'm not, okay, I have zero data. Cisco takes its reputation very seriously as a responsible gatekeeper of your data. And because of that, even I, as somebody who's worked at the company for many years, have almost no access to real customer data. This is my number one biggest headache of working at Cisco, is the data shortage. And, you know, we try to source public datasets, we scrape the internet for things, but those datasets only come so close to real world data. Even when supplemented by the few data sets some have from customers and ourselves that we've scraped together, it's very small and not necessarily representative of the wide distributorship that we're trying to cover. So we have attempted synthetic data generation, using both models and humans with varied success. We've found that, in general, models can do a pretty good job of synthetic data generation, but as soon as you get to something more complex, like trying to generate a synthetic, an entire transcript, or a call between two people let alone three people, or a meeting, you know, a Contact Center call... they're on the order of minutes. But in meetings, you know, the average meeting is 45 minutes long. And I remember trying to get ChatGPT to generate synthetic transcripts between two people. And you know, every once in a while it would produce something that kind of sounded legit, it was really, you know, not bad, maybe a little bit basic. And maybe the sentiment was a little bit too high on the chart of like, really, really angry customer, which was always funny. But the majority of the time, it just produced nonsense or would repeat itself over and over again. And then the challenge became ‘well, now we need to design some system to filter out whatever it's producing, because only a very small percentage of it is actually usable.’ For those more complex tasks, we often turn to human annotators for generating synthetic data and we have had some success, generating big, big transcripts using human annotators. But it's very expensive, very time consuming, and not scalable at all.

11:09 How do you define synthetic data and how do you benchmark it?

11:13 Nina

One of the challenges, of course, with synthetic data is that these models are so large that they can overfit and memorize the data. So what we're going for is something that's pretty specific in the generation that should make it distinct from any of the training data, but then we also would validate and check that nothing has been memorized, or even kind of with a fuzzy search to make sure that it's not regurgitating. And we don't find, like sometimes really short text, you might get lucky in that way, but typically, we're getting something a little bit more distinct. You know, at a certain level, like when you're doing data science and AI, you have to get into the weeds and just look at the data. Like that's gonna be, when you’re prototyping, that's going to be the most useful signal. But then in terms of metrics that we're using to evaluate on a broader scale, we do have subjective evaluations based on the criteria that we're looking for in that particular dataset that is done by another model. And then we're generally trying to match the distribution of the real data along a very long list of attributes. So, we find that the closer we are to that distribution, the more realistic the data will seem. And it does end up being kind of specific to the use case. So it would be something different for transcripts, compared to shorter text. We also check for label accuracy, so that's something that I think is one of the areas where synthetic data really shines is that you can generate the data with the label that was trained in the fine tuning is what you're requesting then in the inference. So it comes out with a label, and then we double check it and make sure that it matches through another system, through an evaluation scoring system. And then when we have those two matches, now, that's going to be much more likely to be an accurate label.

13:15 Elizabeth

I'd say that the question of benchmarking for synthetic data is very similar to the question of benchmarking for any kind of generative AI model. There's multiple layers of it. And sort of the first thing you do is a sanity check and look at the results and see like ‘Are these even close to making sense?’ And then maybe the second step is, ‘Yes, let's get some human annotators to look at these.’ And we call this internally a Turing test. So you present people with, you know, a couple of outputs, like two outputs there and you ask them which one's model-generated and which one is real. And that can be a signal as to how well your generative system is performing. If you're able to fool humans some percentage of the time, and that's a good signal that your model was doing a good job, you know, generating what you want it to.

14:19 Blake

We do a lot of combinations of these as well. I think one of the things it wasn't good on, because we really tried to find outliers in the data. So ‘does anything stand out? can we detect it as an outlier?’ Those usually jump off the pages like ‘there’s something wrong this. this doesn't look like it's legit.’ So we try to find ‘can it pass that test first, if it doesn't look like an outlier?’ So for us, we use lots of synthetic data, we might get synthetic data for recommendation systems, and understand like customer journeys, we might want to see what happens when customers have a really bad experience. We don't want an actual customer to have a bad experience, we want the synthetic data to be there to see what would happen. And some of those, maybe there's bad actors that are your systems. We find those all the time, you might have someone that hits your recommendation system 100 times and puts 100 items in your cart and pulls 100 items out of your cart, and does nothing else, or they'll do this 1000 times. That's probably a bad actor. But it's also a really good way to figure out ‘is that synthetic data legit or not?’ because it doesn't look human. So trying to find things that aren't human is a really good way for us.

15:22 Ali

Evaluation for me personally, I say in general, comes before synthetic. Like, you want to have an evaluator or scoring method to look at your prompt or your output. And it doesn't get to that score. Now you want to add this synthetic data, that's one of the main applications, right? Because you want to increase the accuracy.

15:47 Where do you see the future of LLMs headed?

15:50 Blake

So for me, these large language models are awesome. I think they’re the future, I think there's a lot of things that are there. But I think the large language models in the near future are going to start getting smaller. And I think synthetic data is going to be a big part of them becoming smaller. I think it's going to be about having more curated data, better clean data, data that's really fine tuned to your domain. Very, very domain specific. You're gonna have to start bringing synthetic data into these applications, if you don't have access to the billions, billions of open source data that's out there for your specific application. And I really think that that's the future.

16:26 Othmane

One of the problems that we're starting to know is, how do you move these tables or kind of prototyping environments into production? And I think that's one of the problems that needs to be addressed at scale, and needs to be solved. And part of it is, instead of having such a large language model that can do everything, we can have much smaller models with high quality data that can solve your problem. Because that’s all you care about at the end of the day. You don't need it to do everything, just one thing very well.

17:03 Elizabeth

If we had the data, it would be great to fine tune some models on the specific tasks that we need. Because the scale at which we're working at Cisco, using the state of the art GPT5, even the cheapest version of it, is prohibitively expensive to run at scale. And if we were able to collect a small… you know, it doesn't take much to fine tune a large language really. It really takes less than you think. But you've got to have high quality data to accomplish it.

The ever-evolving landscape of AI presents formidable challenges — ranging from data scarcity and regulatory landscapes to nuanced edge cases and the demands of specialized models. In this panel discussion, experts in the field dive into their strategies for overcoming these hurdles. Through the lens of real-world examples and use cases, they unveil innovative solutions and showcase the pivotal role of synthetic data — particularly the advancements in second-generation synthetic data.

At Nurdle’s Santa Monica LLM event, leaders from CISCO, DRINKS, Spotter, and Nurdle shared insights into their approaches and the evolving landscape of AI. The panel discussion provided invaluable insights into the strategies, technologies, and ethical considerations driving the resolution of AI challenges. The recording also covers the pivotal role of synthetic data in shaping the future of AI development and deployment while ensuring responsible and ethical use.

Recent Events & Content

Expert Panel

Watch On-Demand

In case you haven’t heard, the future of AI is likely to be smaller, use-case specific LLMs that are great at specific tasks and cost far less to run than prototypes built in large LLMs like ChatGPT. But how do you get them to production?

Productionizing Custom Large Language Models

Recorded on Jan 30, 2024
Santa Monica, CA

Happy Hour

We're excited to connect with data scientists and engineers at the LLM Happy Hour hosted by Nurdle! A networking event to discuss challenges, best practices, and what's next in the world of LLM's.

NeurIPS LLM Happy Hour — New Orleans

December 14, 2023,
New Orleans

Expert Panel

Watch On-Demand

Productionizing Custom Large Language Models

Recorded on Nov 28, 2023,
New York

Expert Panel

Productionizing Custom Large Language Models

Recorded on Nov 16 2023,
San Francisco

Watch On-Demand

Happy Hour

We had a great time discussing challenges, best practices, and the future of LLMs with data scientists and engineers after the AI Conference.

AI Conference LLM Happy Hour

September 27th, 2023
San Francisco

Ready to Join us for our next event?

Check out the schedule