Chapter 4. Data Parsimony
Data is the new oil was a common idiom in the early 2010s, used in the context of generating value via digital data. It also unintentionally captures the increasing carbon footprint of storing and processing vast amounts of data. Lifecycle emissions for each terabyte of data on hard drive storage are estimated to be anywhere between 2 and 20kgCO2e per year, as Figure 4-1 illustrates for commonly used storage devices.
Figure 4-1. Typical GHG emissions across the lifecycle of storage devices. (Source: Seagate Sustainability Report.)
Large-scale computations on massive amounts of data have been essential to the progress in AI model development, with the most recent LLMs being trained on datasets that consist of more than 15 trillion data points (tokens).1 Not all of the data used for training ML models is informative, however. Uninformative or duplicative data can also contribute to the notion of AI waste that was presented in Chapter 3. Reducing the amount of data used can have a considerable impact on reducing the energy consumption and carbon footprint of selecting and developing AI models.
In this chapter we introduce methods of identifying informative data points and extracting useful information from them. This chapter offers a paradigm of developing DL models while reducing AI waste from a data perspective, which we refer to as data parsimony. It ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access