2 Your first data program in PySpark

This chapter covers

  • Launching and using the pyspark shell for interactive development
  • Reading and ingesting data into a data frame
  • Exploring data using the DataFrame structure
  • Selecting columns using the select() method
  • Reshaping single-nested data into distinct records using explode()
  • Applying simple functions to your columns to modify the data they contain
  • Filtering columns using the where() method

Data-driven applications, no matter how complex, all boil down to what we can think of as three meta steps, which are easy to distinguish in a program:

  1. We start by loading or reading the data we wish to work with.

  2. We transform the data, either via a few simple instructions or a very complex machine learning ...

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.