:28 Deepak Ramachandran, Staff Research Engineer, Google
I've worked on various projects in the past, one of which involved using BARD to explain recommendations to assistant users. For instance, we would recommend a movie and provide an explanation as to why it would be a good movie for them. Additionally, I have worked on predictive search algorithms, where we predict what the user might search for in the future based on their current search behavior. For example, if someone searches for “How to break up with my boyfriend?”, we might predict that in a week, they will search for “How do I get back to my boyfriend?” This was one of our successful predictions.
1:01 Charlene Chambliss, Senior Software Engineer, Aquarium
At Tide Pool, we use LLMs for qualitative data analysis. We can take a definition of something you care about, like whether a question is about programming or not. Then, we can create a classifier that can determine if the question is related to programming or not. We achieve this using LLM generate relations and fine-tuning the classifier. All of this is made possible by the generative models method.
1:40 What are some key sticking points for getting a custom LLM into production?
1:41 Deepak Ramachandran, Staff Research Engineer, Google
When you've got a custom problem in mind, it is noise, take some data, maybe your project manager is giving you a lot of data, and then you manage to train a model and fine-tune it. And you find one that has just a slight little bit of extra performance compared to your baseline, like, your free training model. And then that's just the first step. And the question is, the costs of actually designing to launch that model, involve things like deployment costs and maintenance costs. You have to have somebody tracking all the metrics for all the different models out there. So it's just extra overhead each time.
2:42 Nina Lopatina, Director of Data Science, Nurdle
One of them was performance, of course, precision and accuracy. And then we also had pretty stringent requirements. Timeliness, because we were processing data in real-time, so sub 10 milliseconds or so, let's put price demanded. So getting larger models down to a size that would run science format.
3:05 Charlene Chambliss, Senior Software Engineer, Aquarium
A lot of the question, when deciding whether or not to deploy a fine-tuned information extraction model, was, is it good enough. So, if you put the results into a product and show it to people, are they going to say, “That's not true.” So a lot of that is getting the quantitative metrics. How is the F1 score and precision, and you get that for each class with LLM. You also have to do the eyeball test for generative LLMs. Because there's not really a good way to quantitatively evaluate them yet.
3:45 What criteria have you evaluated to release a custom LLM?
3:46 Deepak Ramachandran, Staff Research Engineer, Google
It’s useful to consider how you might think of problems, like having evaluation sets that might be out of domain. This is a very big problem for machine learning models, the bigger ones to get really, really good at doing the task on the exact data distribution that it receives. So your data set, what you're getting from production, consists of, say, 75% medical claims and 5% taking advice. Then it gets really good at reaching high accuracy on that. But when that percentage shifts, and life changes, or the season changes, performance can drop much faster than many people see. Because it goes out of distribution, as we call it. So having test sets that cover different mixes of data that you can evaluate against something is really important.
4:44 Nina Lopatina, Director of Data Science, Nurdle
Something like GPT-4, which is an extremely performant model, probably can't create user-generated content style data or social media style data, easily. But you always run the experiment; you don't make any assumptions. So if there were a way to quantify the fellow kid's metrics, that would be awesome. But for now, that's, that's a good eye biometrics. So I think as to what Charlene was saying earlier, for generative models, the evaluation becomes really challenging. At Spectrum Labs, one of our main challenges was that we had a huge class imbalance. For example, most of the people on these various apps are saying nice things to each other. They're, talking about games, other things. So it's a pretty small proportion that is on there to be a troll. So you're not necessarily going to source a lot of truth. And when you move beyond English, that is even more true. So we might try sampling for truth. But in lower resource languages, when we're not performing as well, we're not able to find as many, so this is kind of one of the reasons that Nurdle AI spun out of Spectrum Labs. It is actually easier to generate these more recent examples rather than trying to source them, or at least it's a faster and cheaper way to be able to evaluate correctly, which is the first step to improving performance.
6:15 Charlene Chambliss, Senior Software Engineer, Aquarium
I also did a project where I was evaluating our named entity recognition model on different news sources from different countries. So, non-English speaking countries, and the NPR model, I discovered it was actually doing much worse on certain countries, because they just sort of use English differently, relative to like how the English speaking publications like The New York Times or whatever we're using it and our model was trained on, all of this very clean, like CNN, sort of data. So that was one very obvious point of improvement. And one kind of dead giveaway that was happening was the Chinese data set. For example, I noticed that the performance of organizations was really, really low. It was point two at work. I noticed that in that evaluation data said it was just failing on the exact same entity over and over again, a particular organization name. So it's also worth looking at your evaluation data to make sure that it's diverse and accurately representing the data that's actually going to run in production because it turns out timers did actually have a lot of non-English resources, or non-English speaking countries, writing in English, that tended to be very different. A lot of what I've been doing is just looking at the data. Because especially in just traditional NLP, where you're doing an information extraction task, a lot of people think. I can use quantitative metrics. So I'm just going to look at the f1 score and precision recall. But it actually helps a lot from a qualitative perspective to look at the actual results and see where exactly my model is failing. Which example is it failing on because that points you directly to the kinds of data that you then need to collect to fix the model. So just looking at the data is very underrated as a strategy. It doesn't super scale, but at the beginning, you're trying to kind of get your thing off the ground and make sure that it works. It works very well.
8:40 Tips and tricks you learned to overcome challenges with putting LLMs into production?
8:41 Nina Lopatina, Director of Data Science, Nurdle
One of the things that we had to do at Spectrum Labs was to see how the model is performing on more recent data, as well as a gold standard dataset, to see that we're not seeing unexpected drops and also testing on more recent data because language and slang changes so rapidly. Things can change pretty quickly, similar to what Deepak mentioned about COVID, becoming a new term. We had a number of tests that we would run before you put anything, any updates out.
You have these metrics on long term goal like x. It's really hard to use them reliably as a way to know if model version x is better than modeling and x minus one. So we ended up at the beginning of BART, doing a lot of human evals. We basically took over the entire pool of raters that Google Search uses, deciding search results are better than search rather better or not. And that became very, very expensive and very slow. So we train models that were versions of BARD, that were optimized for actually making these instruments were deciding whether he was better than p or not. That turned out to be very, very useful. You can't rely too much on them. You can build a model that can mimic a reader accurately for model version x. But point at which model version x becomes x plus k, right? The correlation between the operators ratings and the human raters ratings, they start becoming negative, because it's a different set of outputs from a different model. And the reader does not read and just doesn't know how to take them anymore. So this is one was a research problem, actually, they're keeping alternators in line with the actual model. But it's an invaluable technique that I think is going to be more prevalent in the future. Because the old way of evaluating on their static datasets, really strong models, it just doesn't work anymore. You get 99%, really fast, and it's meaningless.
11:00 How do you evaluate whether or not the data you're using are good enough for your users?
11:02 Deepak Ramachandran, Staff Research Engineer, Google
It's useful to have continuous metrics on the training data. To say things like, am I getting enough training data from the domains that I really care about? Is the average length of such sentences suddenly become very short? It's also important to, continuously have a very complex pipeline. Pipelines break without you realizing it. Like if some guy forgets to update the license to some database, that you're creating a Sunday, you're not getting data based on that data and use that database anymore. So a lot of ML engineer's time actually gets spent in the data part.
11:55 Nina Lopatina, Director of Data Science, Nurdle
In an earlier rollout, thank you to Spectrum Labs; we did a lot of work with open-source data. That was definitely a place where I would look at the data first. In some cases, suchh as when I was working on a machine translation quality estimation task, and it was Russian into English, and I speak both. I looked at it and noticed that these data are not really useful for evaluating that task. And some of these translations are really bad, and scores only light up. The very first step was just looking at a few lines and seeing if it made sense. And then, Spectrum Labs, where the data we had was higher quality than open source, crucial data sets, we had a lot of measurements around the quantities that we had to make sure that we had enough trues and enough across languages. And then for synthetic data, we're looking at a lot of metrics around the diversity of the data, because that is really important. If you just have 10,000 rows of very similar things, that's not going to help you. So even though blue tends not to be a useful metric itself, blue, which looks up the comparison between all the rows of data, is quite helpful, as well as embedding-based methods that capture the same thing.
13:29 Charlene Chambliss, Senior Software Engineer, Aquarium
I like to think about it in a very top-down way. What do users want to actually do with this? And what are they going to care about in terms of the outputs and the quality of the outputs. So for something like named entity recognition and primer, the audience for that was largely intelligence analysts. And so they cared a lot about the quality of life, the people, in places predictions of the named entity recognition model, and not as much about the miscellaneous category of various product names. So thinking about it from the perspective of your user, can really help you focus on which areas of the model are most important to improve because otherwise, you could just dump the time into improving your dataset. And actually, related to what you were saying, I have actually managed to make the model worse, which I'm giving it four days.