Chapter 5. AI Data Pipeline

In God we trust; all others bring data.

W. Edwards Deming

There are now more mobile devices than people on the planet, and each is collecting data every second on our habits, physical activity, locations traveled, and daily preferences. Daily, we create 2.5 quintillion bytes of data from a wide variety of sources. And it’s coming from everywhere. Just think of all the sources collecting data—IoT sensors in the home, social media posts, pictures, videos, all our purchase transactions, as well as GPS location data monitoring our every move.

Data is even touted as being more important and valuable than oil. For that reason, companies are creating vast repositories of raw data (typically called data lakes)—both historical and real-time. Being able to apply AI to this enormous quantity of data is a dream of many companies across industries. To do so, you have to pick the right set of tools not only to store the data but also to access it as efficiently as possible. Current tools are evolving, and how you store and present your data must change accordingly. Failure to do so will leave you and your data behind. To illustrate this point, MIT professor Erik Brynjolfsson performed a study that found firms using data-driven decision making are 5% more productive and profitable than competitors. Additional research shows that organizations using analytics see a payback of $9.01 for every dollar spent.

As we’ve seen so far, if large amounts of high-quality data ...

Get Getting Started with Artificial Intelligence, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.