1 Introduction

This chapter covers

  • What PySpark is
  • Why PySpark is a useful tool for analytics
  • The versatility of the Spark platform and its limitations
  • PySpark’s way of processing data

According to pretty much every news outlet, data is everything, everywhere. It’s the new oil, the new electricity, the new gold, plutonium, even bacon! We call it powerful, intangible, precious, dangerous. At the same time, data itself is not enough: it is what you do with it that matters. After all, for a computer, any piece of data is a collection of zeroes and ones, and it is our responsibility, as users, to make sense of how it translates to something useful.

Just like oil, electricity, gold, plutonium, and bacon (especially bacon!), our appetite for data ...

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.