March 2022
Beginner to intermediate
456 pages
13h
English
While computers have been getting more powerful and more capable of chewing though larger data sets, our appetite for consuming data grows much faster. Consequently, we built new tools to scale big data jobs across multiple machines. This does not come for free, and early tools were complicated by requiring users to manage not only the data program, but also the health and performance of the cluster of machines themselves. I recall trying to scale my own programs, only to be faced with the advice to “just sample your data set and get on with your day.”
PySpark changes the game. Starting with the popular Python programming language, it provides a clear and readable API to manipulate very large data sets. Still, while in the ...