3 Submitting and scaling your first PySpark program

This chapter covers

  • Summarizing data using groupby and a simple aggregate function
  • Ordering results for display
  • Writing data from a data frame
  • Using spark-submit to launch your program in batch mode
  • Simplifying PySpark writing using method chaining
  • Scaling your program to multiple files at once

Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:

  1. [DONE]Read: Read the input data (we’re assuming a plain text file).

  2. [DONE]Token: Tokenize each word.

  3. [DONE]Clean: Remove any punctuation and/or tokens ...

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.