Chapter 5. Secure and Trustworthy Data Generation
Anyone who’s been in the machine learning space even a little bit understands the importance of data. Still, it’s underappreciated just how important data is. OpenAI published a paper on scaling laws of large language models in 2020 and concluded that scaling up the model size would be enough to get more capable ML models on a given dataset.1 However, with their 2022 Chinchilla network paper, DeepMind demonstrated that parameters alone don’t make the model.2 DeepMind demonstrated that the dataset needs to scale with the size of the model. While the evidence for this scaling law is compelling, the issue is that many teams are much more constrained by data than they are by number of parameters.
One of the most common barriers on any machine learning project is simply getting sufficient training data for your machine learning models.3 Data acquisition can be very challenging because a large enough sample needs to be collected to be representative of the entire population of data. In some cases, the challenge might be getting any data at all to feed into the model. With these concerns in mind, it’s important to remember the common pitfalls of sourcing datasets.
In the book, we’ve already covered a few examples of failure cases for machine learning models. These include societal or non-societal biases (see Chapter 2), focusing on the wrong features in data (see Chapter 3), and failing to capture the full distribution of a phenomenon ...
Get Practicing Trustworthy Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.