1:09 Darren
We do a lot of fine-tuning of open-source foundational models. Our basic premise is that smaller specialized models in open source that, when they're fine-tuned appropriately, can deliver performance that is state of the art, as good or almost as good as the open AI and the Anthropics, at a dramatically lower cost, that is fit for purpose, and can much more easily be integrated into a company's workflow. So we have fine-tuned models from about a billion parameters to 7 billion, some 13 and 20, that we've actually put a whole collection out into open source, a CPU-oriented set of models, because actually if you look when you're doing something with sensitive data, you actually want to be able to do something on your laptop without sending sensitive company information out to public API. We were still trying to evaluate if this was even going to work. So we built a series of instruct fine-tune models running on a laptop from a billion to 3 billion parameters. So you can do RAG, where all the data, all the documents, and all the embeddings never leave your laptop. So we've done a lot of work with those kinds of smaller l ones evaluating what they can and can't do.
2:33 Chris
We know that LLMs will hallucinate wildly. And so using them as a way to fact check is actually really dangerous. So instead of the generic text, what I tend to use, is the embeddings. Specifically, I'm working on a bunch of research problems around detecting large-scale misinformation campaigns and using the embeddings to link very different sets of claims that might be made by the same group or person but are in different disparate topics that might be coming out. So if you're a Russian troll farm, you may be discussing something about a volcano going off in Iceland. But you also might be talking about Trump to a regular human, and we wouldn't be able to connect those. But using the embedding models, we may be able to group those within a large geometric space, to be able to show those a little bit better.
3:38 Leo
At Warner Bros, we have lots of discovery with video content and a lot of news, sports, and entertainment. So that also comes with text. And obviously, as we're speaking, all of this can be transcribed into text; therefore, you have this entire archive of texts. Not to mention lots of movies and shows have lots of content that's generally online about your reactions to watching Game of Thrones, or what the the plot descriptions, or somebody publishes an article about our IP. That's a lot of text to kind of grab from, from different sources. So one of the very easy low-hanging fruit use cases is to describe, to have a normalized process of describing the content and describing the gamut of various creative forms. Our general philosophy for utilizing not just elements but machine learning is to involve human experts. What we're trying to deploy or trying to experiment with is to try to convert a totally manual process to a more, human machine. The reprocess where the data is fact-checked, and not only that, you are using the ground truth labels to continue to train around the lens or your other machine learning models to continue to perform better, and specifically performed better in your space, your content space, and your data space.
5:20 Nina
At Nurdle, we're creating content and text for people who are often working on classifying user-generated content, which is a very different style from what many of the assistant-like models were trained on. So we ended up needing to fine-tune models in order to be able to create data like that because it's just not something that we can find without fine-tuning.
5:48 Leo
General machine learning techniques can automate video or text editing tasks that were previously done manually. However, if we consider how a ChatGPT was constructed, it is designed to optimize natural language to make it feel like you're having a real conversation with someone. They generate a lot of text, even if it's not necessary, to provide context. On the other hand, when introducing a level of automation, machine learning tasks are very specific and require human verification. Using a large LLM (Language Model) can cause hallucinations and provide unnecessary context. Moreover, it can be slow to run and require expensive in-house GPU resources. These issues can cause problems where the system fails to deliver.
We start to turn to the smaller primary models, and also the open-source ones where rather than using just an open API. You're sending it very proprietary and sensitive data, and oftentimes, the creative data ( the rights don't belong with our company; they belong with the creators.) Those are instances where you don't want to send that over. And it's such that an open API, even with your NDA or whatever agreements are in place. So we find ourselves having to post a lot of these in-house and fine-tune them in that we end up owning the models themselves, as IP is not just the data itself.
8:04 Chris
One of the important things we need to do is look at customer complaint data on social media against, for example, regulated banks. So customers might be complaining that their deposits are not showing up, their remittances from abroad aren't arriving, or their ATMs have stopped working, which can be indicative of something going slightly wrong. But especially when you're a small and new startup like us, that essentially would require manual tagging of every single comment, which is 10s of 1000s, hundreds of 1000s.
So we need things like synthetic data, and that's why we're working with Nurdle on the pilot, so that we can train not just LLMs but other types of machine learning models as well. Because this data is good for not just our labs, but machine learning. Outlines are one very specific part of it. So we can train these models to recognize better and classify comments that we might not have seen without having to hire a ton of people with financial knowledge, unfortunately, because you need that experience to be able to go through and take every single one.
9:25 Darren
What's led to so many of the dramatic improvements that we've seen, both in the models, as well as in the open source community, I think you have to say the biggest contributor to LLMs has been data. And I think it's in three places. First, it's in the base pre-training of that model. Data is everything in the sense that it ultimately gives the model's foundational capability. Second, data is everything when it comes to fine-tuning. It's a data exercise. So once you have your secret recipe, sauce, and secret data, you can get a model to do almost whatever behavior you want if you've designed that dataset the right way. So it's everything for fine-tuning. Then, finally, Data is everything; it's your data. It's a company's data. Most of the interesting work that will be done with elements in the enterprise in one form or another will tie into private knowledge from that enterprise, which works knowledge of that data and, ultimately, the application of a model inference statement to that data. It's really what's what's going to enable all the really interesting stuff with AI?
11:33 Nina
At Nurdle, one of our big concerns in Nurdle is the license of the model that we're using. So depending on what our design partners are interested in, if it's data for fine-tuning a model or for a classifier, then as long as it's a commercial license, that's okay. But if we're just creating data for someone to pre-train their foundation model, we must abide by the licenses. So you can't use raw material anymore in that scenario. Then we ended up having to choose a few different models to work with, which of course, has some stumbling blocks, as you kind of try to try to shift your workflows between them. And so we don't have a lot of other data coming in that we have to worry about. But at Spectrum, it was exactly the opposite. We were moderating 15,000 records per second for six years. Fortunately, we could keep all that data to use as the seed for some of our fine-tuning since it's a wide range of topics and languages. But for that data itself, we had it PII scrubbed before it ever touched any system that anyone at the company had access to.
12:49 Chris
The nice thing is that all the data I work with everywhere is naturally public. The big difference that we have is regulatory across jurisdictions, so all of the work I do, very little of it is based stateside. But when I have to deal with the Philippines and Ghana in the same 12 hours, it gets really complicated and very expensive. Because they have various specific differences in their data privacy, like one doesn't have one, one is restrictive, but only around specific topics. And so that's something that I really like synthetic metaphors because then I don't have to worry about those problems, especially when I'm testing out a hypothesis or maybe not on the full production, but just to see if it might actually work. And then I can start calling up lawyers and various regulators and make sure that what we're doing is kosher.
13:54 Leo
So thinking about PII is obviously with GDPR, and CCPA, which very much is the governance of consumer data. So if you take your digital footprint or purchase history or transaction, your biometrics, those should belong to you, and your government can find you to request to delete it and such. But kind of the flip side of this, because now we're working with LLMs more, more text data, a lot of this text came from kind of images. Also, in all these image models. The data came from creative talents. And that really brings a question about not just who owns the data but who owns the resulting models for training or fine-tune by this data. And very importantly, who gets to profit from this. So all of this has been built through very strict governance and new challenges we were considering in the LLM industry.