Chapter 1. Distributed Machine Learning Terminology and Concepts
Remember when data scientists ran their machine learning algorithms on datasets that fit in a laptop’s memory? Or generated their own data? It wasn’t because of a lack of data in the world; we had already entered the Zettabyte Era.1 For many, the data was there, but it was locked in the production systems that created, captured, copied, and processed data at a massive scale. Data scientists knew that gaining access to it would allow them to produce better, more profound machine learning models. But this wasn’t the only problem—what about computation? In many cases, data scientists didn’t have access to sufficient computation power or tools to support running machine learning algorithms on large datasets. Because of this, they had to sample their data and work with CSV or text files.
When the public cloud revolution occurred around 2016–2017, we could finally get hold of that desired computation capacity. All we needed was a credit card in one hand and a computer mouse in the other. One click of a button and boom, hundreds of machines were available to us! But we still lacked the proper open source tools to process huge amounts of data. There was a need for distributed compute and for automated tools with a healthy ecosystem.
The growth in digitalization, where businesses use digital technologies to change their business model and create new revenue streams and value-producing opportunities, increased data scientists’ ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access