book

Data Analysis with Python and PySpark

by Jonathan Rioux

March 2022

Beginner to intermediate

456 pages

13h

English

Manning Publications

Read now

Unlock full access

prefaceacknowledgmentsabout this bookWho should read this bookHow this book is organized: A road mapAbout the codeliveBook discussion forumabout the authorabout the cover illustration
1.1 What is PySpark?1.1.1 Taking it from the start: What is Spark?1.1.2 PySpark = Spark + Python1.1.3 Why PySpark?1.2 Your very own factory: How PySpark works1.2.1 Some physical planning with the cluster manager1.2.2 A factory made efficient through a lazy leader1.3 What will you learn in this book?1.4 What do I need to get started?Summary
2.1 Setting up the PySpark shell2.1.1 The SparkSession entry point2.1.2 Configuring how chatty spark is: The log level2.2 Mapping our program2.3 Ingest and explore: Setting the stage for data transformation2.3.1 Reading data into a data frame with spark.read2.3.2 From structure to content: Exploring our data frame with show()2.4 Simple column transformations: Moving from a sentence to a list of words2.4.1 Selecting specific columns using select()2.4.2 Transforming columns: Splitting a string into a list of words2.4.3 Renaming columns: alias and withColumnRenamed2.4.4 Reshaping your data: Exploding a list into rows2.4.5 Working with words: Changing case and removing punctuation2.5 Filtering rowsSummaryAdditional exercisesExercise 2.2Exercise 2.3Exercise 2.4Exercise 2.5Exercise 2.6Exercise 2.7
3.1 Grouping records: Counting word frequencies3.2 Ordering the results on the screen using orderBy3.3 Writing data from a data frame3.4 Putting it all together: Counting3.4.1 Simplifying your dependencies with PySpark’s import conventions3.4.2 Simplifying our program via method chaining3.5 Using spark-submit to launch your program in batch mode3.6 What didn’t happen in this chapter3.7 Scaling up our word frequency programSummaryAdditional ExercisesExercise 3.3Exercise 3.4Exercise 3.5Exercise 3.6
4.1 What is tabular data?4.1.1 How does PySpark represent tabular data?4.2 PySpark for analyzing and processing tabular data4.3 Reading and assessing delimited data in PySpark4.3.1 A first pass at the SparkReader specialized for CSV files4.3.2 Customizing the SparkReader object to read CSV data files4.3.3 Exploring the shape of our data universe4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing4.4.1 Knowing what we want: Selecting columns4.4.2 Keeping what we need: Deleting columns4.4.3 Creating what’s not there: New columns with withColumn()4.4.4 Tidying our data frame: Renaming and reordering columns4.4.5 Diagnosing a data frame with describe() and summary()SummaryAdditional exercisesExercise 4.3Exercise 4.4

5.1 From many to one: Joining data5.1.1 What’s what in the world of joins5.1.2 Knowing our left from our right5.1.3 The rules to a successful join: The predicates5.1.4 How do you do it: The join method5.1.5 Naming conventions in the joining world5.2 Summarizing the data via groupby and GroupedData5.2.1 A simple groupby blueprint5.2.2 A column is a column: Using agg() with custom column definitions5.3 Taking care of null values: Drop and fill5.3.1 Dropping it like it’s hot: Using dropna() to remove records with null values5.3.2 Filling values to our heart’s content using fillna()5.4 What was our question again? Our end-to-end programSummaryAdditional exercisesExercise 5.4Exercise 5.5Exercise 5.6Exercise 5.7
6.1 Reading JSON data: Getting ready for the schemapocalypse6.1.1 Starting small: JSON data as a limited Python dictionary6.1.2 Going bigger: Reading JSON data in PySpark6.2 Breaking the second dimension with complex data types6.2.1 When you have more than one value: The array6.2.2 The map type: Keys and values within a column6.3 The struct: Nesting columns within columns6.3.1 Navigating structs as if they were nested columns6.4 Building and using the data frame schema6.4.1 Using Spark types as the base blocks of a schema6.4.2 Reading a JSON document with a strict schema in place6.4.3 Going full circle: Specifying your schemas in JSON6.5 Putting it all together: Reducing duplicate data with complex data types6.5.1 Getting to the “just right” data frame: Explode and collect6.5.2 Building your own hierarchies: Struct as a functionSummaryAdditional exercisesExercise 6.4Exercise 6.5Exercise 6.6Exercise 6.7Exercise 6.8
7.1 Banking on what we know: pyspark.sql vs. plain SQL7.2 Preparing a data frame for SQL7.2.1 Promoting a data frame to a Spark table7.2.2 Using the Spark catalog7.3 SQL and PySpark7.4 Using SQL-like syntax within data frame methods7.4.1 Get the rows and columns you want: select and where7.4.2 Grouping similar records together: group by and order by7.4.3 Filtering after grouping using having7.4.4 Creating new tables/views using the CREATE keyword7.4.5 Adding data to our table using UNION and JOIN7.4.6 Organizing your SQL code better through subqueries and common table expressions7.4.7 A quick summary of PySpark vs. SQL syntax7.5 Simplifying our code: Blending SQL and Python7.5.1 Using Python to increase the resiliency and simplifying the data reading stage7.5.2 Using SQL-style expressions in PySpark7.6 ConclusionSummaryAdditional exercisesExercise 7.2Exercise 7.3Exercise 7.4Exercise 7.5
8.1 PySpark, freestyle: The RDD8.1.1 Manipulating data the RDD way: map(), filter(), and reduce()8.2 Using Python to extend PySpark via UDFs8.2.1 It all starts with plain Python: Using typed Python functions8.2.2 From Python function to UDFs using udf()SummaryAdditional exercisesExercise 8.3Exercise 8.4Exercise 8.5Exercise 8.6
9.1 Column transformations with pandas: Using Series UDF9.1.1 Connecting Spark to Google’s BigQuery9.1.2 Series to Series UDF: Column functions, but with pandas9.1.3 Scalar UDF + cold start = Iterator of Series UDF9.2 UDFs on grouped data: Aggregate and apply9.2.1 Group aggregate UDFs9.2.2 Group map UDF9.3 What to use, whenSummaryAdditional exercisesExercise 9.2Exercise 9.3Exercise 9.4Exercise 9.5
10.1 Growing and using a simple window function10.1.1 Identifying the coldest day of each year, the long way10.1.2 Creating and using a simple window function to get the coldest days10.1.3 Comparing both approaches10.2 Beyond summarizing: Using ranking and analytical functions10.2.1 Ranking functions: Quick, who’s first?10.2.2 Analytic functions: Looking back and ahead10.3 Flex those windows! Using row and range boundaries10.3.1 Counting, window style: Static, growing, unbounded10.3.2 What you are vs. where you are: Range vs. rows10.4 Going full circle: Using UDFs within windows10.5 Look in the window: The main steps to a successful window functionSummaryAdditional ExercisesExercise 10.4Exercise 10.5Exercise 10.6Exercise 10.7
11.1 Open sesame: Navigating the Spark UI to understand the environment11.1.1 Reviewing the configuration: The environment tab11.1.2 Greater than the sum of its parts: The Executors tab and resource management11.1.3 Look at what you’ve done: Diagnosing a completed job via the Spark UI11.1.4 Mapping the operations via Spark query plans: The SQL tab11.1.5 The core of Spark: The parsed, analyzed, optimized, and physical plans11.2 Thinking about performance: Operations and memory11.2.1 Narrow vs. wide operations11.2.2 Caching a data frame: Powerful, but often deadly (for perf)Summary
12.1 Reading, exploring, and preparing our machine learning data set12.1.1 Standardizing column names using toDF()12.1.2 Exploring our data and getting our first feature columns12.1.3 Addressing data mishaps and building our first feature set12.1.4 Weeding out useless records and imputing binary features12.1.5 Taking care of extreme values: Cleaning continuous columns12.1.6 Weeding out the rare binary occurrence columns12.2 Feature creation and refinement12.2.1 Creating custom features12.2.2 Removing highly correlated features12.3 Feature preparation with transformers and estimators12.3.1 Imputing continuous features using the Imputer estimator12.3.2 Scaling our features using the MinMaxScaler estimatorSummary
13.1 Transformers and estimators: The building blocks of ML in Spark13.1.1 Data comes in, data comes out: The Transformer13.1.2 Data comes in, transformer comes out: The Estimator13.2 Building a (complete) machine learning pipeline13.2.1 Assembling the final data set with the vector column type13.2.2 Training an ML model using a LogisticRegression classifier13.3 Evaluating and optimizing our model13.3.1 Assessing model accuracy: Confusion matrix and evaluator object13.3.2 True positives vs. false positives: The ROC curve13.3.3 Optimizing hyperparameters with cross-validation13.4 Getting the biggest drivers from our model: Extracting the coefficientsSummary
14.1 Creating your own transformer14.1.1 Designing a transformer: Thinking in terms of Params and transformation14.1.2 Creating the Params of a transformer14.1.3 Getters and setters: Being a nice PySpark citizen14.1.4 Creating a custom transformer’s initialization function14.1.5 Creating our transformation function14.1.6 Using our transformer14.2 Creating your own estimator14.2.1 Designing our estimator: From model to params14.2.2 Implementing the companion model: Creating our own Mixin14.2.3 Creating the ExtremeValueCapper estimator14.2.4 Trying out our custom estimator14.3 Using our transformer and estimator in an ML pipeline14.3.1 Dealing with multiple inputCols14.3.2 In practice: Inserting custom components into an ML pipelineSummaryConclusion: Have data, am happy!
Chapter 2Exercise 2.1Exercise 2.2Exercise 2.3Exercise 2.4Exercise 2.5Exercise 2.6Exercise 2.7Chapter 3Exercise 3.1Exercise 3.2Exercise 3.3Exercise 3.4Exercise 3.5Exercise 3.6Chapter 4Exercise 4.1Exercise 4.2Exercise 4.3Exercise 4.4Chapter 5Exercise 5.1Exercise 5.2Exercise 5.3Exercise 5.4Exercise 5.5Exercise 5.6Exercise 5.7Chapter 6Exercise 6.1Exercise 6.2Exercise 6.3Exercise 6.4Exercise 6.5Exercise 6.6Exercise 6.7Exercise 6.8Chapter 7Exercise 7.1Exercise 7.2Exercise 7.3Exercise 7.4Exercise 7.5Chapter 8Exercise 8.1Exercise 8.2Exercise 8.3Exercise 8.4Exercise 8.5Exercise 8.6Chapter 9Exercise 9.1Exercise 9.2Exercise 9.3Exercise 9.4Exercise 9.5Chapter 10Exercise 10.1Exercise 10.2Exercise 10.3Exercise 10.4Exercise 10.5Exercise 10.6Exercise 10.7Chapter 11Exercise 11.1Exercise 11.2Exercise 11.3Chapter 13Exercise 13.1
B.1 Installing PySpark on your local machineB.2 WindowsB.2.1 Install JavaB.2.2 Install 7-zipB.2.3 Download and install Apache SparkB.2.4 Configure Spark to work seamlessly with PythonB.2.5 Install PythonB.2.6 Launching an IPython REPL and starting PySparkB.2.7 (Optional) Install and run Jupyter to use a Jupyter notebookB.3 macOSB.3.1 Install HomebrewB.3.2 Install Java and SparkB.3.3 Configure Spark to work seamlessly with PythonB.3.4 Install Anaconda/PythonB.3.5 Launching an IPython REPL and starting PySparkB.3.6 (Optional) Install and run Jupyter to use Jupyter notebookB.4 GNU/Linux and WSLB.4.1 Install JavaB.4.2 Installing SparkB.4.3 Configure Spark to work seamlessly with PythonB.4.4 Install Python 3, IPython, and the PySpark packageB.4.5 Launch PySpark with IPythonB.4.6 (Optional) Install and run Jupyter to use Jupyter notebookB.5 PySpark in the cloudB.6 AWSB.7 AzureB.8 GCPB.9 Databricks
C.1 List comprehensionsC.2 Packing and unpacking arguments (*args and **kwargs)C.2.1 Argument unpackingC.2.2 Argument packingC.2.3 Packing and unpacking keyword argumentsC.3 Python’s typing and mypy/pyrightC.4 Python closures and the PySpark transform() methodC.5 Python decorators: Wrapping a function to change its behavior

Content preview from Data Analysis with Python and PySpark

3 Submitting and scaling your first PySpark program

This chapter covers

Summarizing data using groupby and a simple aggregate function
Ordering results for display
Writing data from a data frame
Using spark-submit to launch your program in batch mode
Simplifying PySpark writing using method chaining
Scaling your program to multiple files at once

Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:

[DONE]Read: Read the input data (we’re assuming a plain text file).
[DONE]Token: Tokenize each word.
[DONE]Clean: Remove any punctuation and/or tokens ...