Chances are, the data you’re looking for already exists - or at least a small set you can use to start your project. Specifically, repository sites like
AWS,
Google,
Hugging Face,
Kaggle, and
UCI Repository Dataset, are a goldmine for data and models you can repurpose into your projects. Hugging Face, in particular, has 82,000+ datasets and 100,000s of pre-trained models that are ripe for picking. On the other hand, Kaggle, the data science competition site, is where teams compete to improve models.Instead of starting from scratch, you can source a few models from Hugging Face and test them out for your use cases. Once you’ve landed on one that works well (or even just okay), you can fine-tune it using datasets from the Hugging Face vault. Hugging Face itself has an ‘AutoTrain’ feature that is great at simplifying the process of training and deploying AI models. It’s a great solution for intermediate developers who want to launch their projects quickly and efficiently.
However, repository data has its limits. Datasets from online repositories generally are made for testing rather than widespread commercial use. Chances are, they won’t include the specific use cases, industry knowledge, or edge cases needed for your specific AI project to be accurate enough to release into production. So, you’ll still need to find datasets to fine-tune your model from elsewhere once you’ve built the bones with Hugging Face assets.