2 Your first data program in PySpark
This chapter covers
- Launching and using the
pyspark
shell for interactive development - Reading and ingesting data into a data frame
- Exploring data using the
DataFrame
structure - Selecting columns using the
select()
method - Reshaping single-nested data into distinct records using
explode()
- Applying simple functions to your columns to modify the data they contain
- Filtering columns using the
where()
method
Data-driven applications, no matter how complex, all boil down to what we can think of as three meta steps, which are easy to distinguish in a program:
-
We start by loading or reading the data we wish to work with.
-
We transform the data, either via a few simple instructions or a very complex machine learning ...
Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.