Skip to Content
Data Analysis with Python and PySpark
book

Data Analysis with Python and PySpark

by Jonathan Rioux
March 2022
Beginner to intermediate
456 pages
13h
English
Manning Publications
Content preview from Data Analysis with Python and PySpark

3 Submitting and scaling your first PySpark program

This chapter covers

  • Summarizing data using groupby and a simple aggregate function
  • Ordering results for display
  • Writing data from a data frame
  • Using spark-submit to launch your program in batch mode
  • Simplifying PySpark writing using method chaining
  • Scaling your program to multiple files at once

Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:

  1. [DONE]Read: Read the input data (we’re assuming a plain text file).

  2. [DONE]Token: Tokenize each word.

  3. [DONE]Clean: Remove any punctuation and/or tokens ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Analysis with Pandas and Python

Data Analysis with Pandas and Python

Boris Paskhaver

Publisher Resources

ISBN: 9781617297205Supplemental ContentPublisher SupportOtherPublisher WebsiteSupplemental ContentPurchase Link