Preface
Apache Spark’s long lineage of predecessors, from MPI (message passing interface) to MapReduce, made it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to them that its scope is defined by what these frameworks can handle. Spark’s original promise was to take this a little further—to make writing distributed programs feel like writing regular programs.
The rise in Spark’s popularity coincided with that of the Python data (PyData) ecosystem. So it makes sense that Spark’s Python API—PySpark—has significantly grown in popularity over the last few years. Although the PyData ecosystem has recently sprung up some distributed programming options, Apache Spark remains one of the most popular choices for working with large datasets across industries and domains. Thanks to recent efforts to integrate PySpark with the other PyData tools, learning the framework can help you boost your productivity significantly as a data science practitioner.
We think that the best way to teach data science is by example. To that end, we have put together a book of applications, trying to touch on the interactions between the most common algorithms, datasets, and design patterns in large-scale analytics. This book isn’t meant to be read cover to cover: page to a chapter ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access