book

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

June 2022

Beginner to intermediate

233 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Did We Write This Book Now?How This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Analyzing Big Data
Working with Big DataIntroducing Apache Spark and PySparkComponentsPySparkEcosystemSpark 3.0PySpark Addresses Challenges of Data ScienceWhere to Go from Here
2. Introduction to Data Analysis with PySpark
Spark ArchitectureInstalling PySparkSetting Up Our DataAnalyzing Data with the DataFrame APIFast Summary Statistics for DataFramesPivoting and Reshaping DataFramesJoining DataFrames and Selecting FeaturesScoring and Model EvaluationWhere to Go from Here
3. Recommending Music and the Audioscrobbler Dataset
Setting Up the DataOur Requirements for a Recommender SystemAlternating Least Squares AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here
4. Making Predictions with Decision Trees and Decision Forests
Decision Trees and ForestsPreparing the DataOur First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom ForestsMaking PredictionsWhere to Go from Here
5. Anomaly Detection with K-means Clustering
K-means ClusteringIdentifying Anomalous Network TrafficKDD Cup 1999 DatasetA First Take on ClusteringChoosing kVisualization with SparkRFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here
6. Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet AllocationLDA in PySparkGetting the DataSpark NLPSetting Up Your EnvironmentParsing the DataPreparing the Data Using Spark NLPTF-IDFComputing the TF-IDFsCreating Our LDA ModelWhere to Go from Here
7. Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the DataConverting Datetime Strings to TimestampsHandling Invalid RecordsGeospatial AnalysisIntro to GeoJSONGeoPandasSessionization in PySparkBuilding Sessions: Secondary Sorts in PySparkWhere to Go from Here
8. Estimating Financial Risk
TerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreparing the DataDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsWhere to Go from Here
9. Analyzing Genomics Data and the BDG Project
Decoupling Storage from ModelingSetting Up ADAMIntroduction to Working with Genomics Data Using ADAMFile Format Conversion with the ADAM CLIIngesting Genomics Data Using PySpark and ADAMPredicting Transcription Factor Binding Sites from ENCODE DataWhere to Go from Here

10. Image Similarity Detection with Deep Learning and PySpark LSH
PyTorchInstallationPreparing the DataResizing Images Using PyTorchDeep Learning Model for Vector Representation of ImagesImage EmbeddingsImport Image Embeddings into PySparkImage Similarity Search Using PySpark LSHNearest Neighbor SearchWhere to Go from Here
11. Managing the Machine Learning Lifecycle with MLflow
Machine Learning LifecycleMLflowExperiment TrackingManaging and Serving ML ModelsCreating and Using MLflow ProjectsWhere to Go from Here
Index
About the Authors

Content preview from Advanced Analytics with PySpark

Chapter 4. Making Predictions with Decision Trees and Decision Forests

Classification and regression are the oldest and most well-studied types of predictive analytics. Most algorithms you will likely encounter in analytics packages and libraries are classification or regression techniques, like support vector machines, logistic regression, neural networks, and deep learning. The common thread linking regression and classification is that both involve predicting one (or more) values given one (or more) other values. To do so, both require a body of inputs and outputs to learn from. They need to be fed both questions and known answers. For this reason, they are known as types of supervised learning.

PySpark MLlib offers implementations of a number of classification and regression algorithms. These include decision trees, naïve Bayes, logistic regression, and linear regression. The exciting thing about these algorithms is that they can help predict the future—or at least, predict the things we don’t yet know for sure, like the likelihood you will buy a car based on your online behavior, whether an email is spam given the words it contains, or which acres of land are likely to grow the most crops given their location and soil chemistry.

In this chapter, we will focus on a popular and flexible type of algorithm for both classification and regression (decision trees) and the algorithm’s extension (random decision forests). First, we will understand the basics of decision trees and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Mastering Big Data Analytics with PySpark

Publisher Resources

ISBN: 9781098103644Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 4. Making Predictions with Decision Trees and Decision Forests

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.