book

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

June 2022

Beginner to intermediate

233 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Did We Write This Book Now?How This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Analyzing Big Data
Working with Big DataIntroducing Apache Spark and PySparkComponentsPySparkEcosystemSpark 3.0PySpark Addresses Challenges of Data ScienceWhere to Go from Here
2. Introduction to Data Analysis with PySpark
Spark ArchitectureInstalling PySparkSetting Up Our DataAnalyzing Data with the DataFrame APIFast Summary Statistics for DataFramesPivoting and Reshaping DataFramesJoining DataFrames and Selecting FeaturesScoring and Model EvaluationWhere to Go from Here
3. Recommending Music and the Audioscrobbler Dataset
Setting Up the DataOur Requirements for a Recommender SystemAlternating Least Squares AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here
4. Making Predictions with Decision Trees and Decision Forests
Decision Trees and ForestsPreparing the DataOur First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom ForestsMaking PredictionsWhere to Go from Here
5. Anomaly Detection with K-means Clustering
K-means ClusteringIdentifying Anomalous Network TrafficKDD Cup 1999 DatasetA First Take on ClusteringChoosing kVisualization with SparkRFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here
6. Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet AllocationLDA in PySparkGetting the DataSpark NLPSetting Up Your EnvironmentParsing the DataPreparing the Data Using Spark NLPTF-IDFComputing the TF-IDFsCreating Our LDA ModelWhere to Go from Here
7. Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the DataConverting Datetime Strings to TimestampsHandling Invalid RecordsGeospatial AnalysisIntro to GeoJSONGeoPandasSessionization in PySparkBuilding Sessions: Secondary Sorts in PySparkWhere to Go from Here
8. Estimating Financial Risk
TerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreparing the DataDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsWhere to Go from Here
9. Analyzing Genomics Data and the BDG Project
Decoupling Storage from ModelingSetting Up ADAMIntroduction to Working with Genomics Data Using ADAMFile Format Conversion with the ADAM CLIIngesting Genomics Data Using PySpark and ADAMPredicting Transcription Factor Binding Sites from ENCODE DataWhere to Go from Here

10. Image Similarity Detection with Deep Learning and PySpark LSH
PyTorchInstallationPreparing the DataResizing Images Using PyTorchDeep Learning Model for Vector Representation of ImagesImage EmbeddingsImport Image Embeddings into PySparkImage Similarity Search Using PySpark LSHNearest Neighbor SearchWhere to Go from Here
11. Managing the Machine Learning Lifecycle with MLflow
Machine Learning LifecycleMLflowExperiment TrackingManaging and Serving ML ModelsCreating and Using MLflow ProjectsWhere to Go from Here
Index
About the Authors

Content preview from Advanced Analytics with PySpark

Preface

Apache Spark’s long lineage of predecessors, from MPI (message passing interface) to MapReduce, made it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to them that its scope is defined by what these frameworks can handle. Spark’s original promise was to take this a little further—to make writing distributed programs feel like writing regular programs.

The rise in Spark’s popularity coincided with that of the Python data (PyData) ecosystem. So it makes sense that Spark’s Python API—PySpark—has significantly grown in popularity over the last few years. Although the PyData ecosystem has recently sprung up some distributed programming options, Apache Spark remains one of the most popular choices for working with large datasets across industries and domains. Thanks to recent efforts to integrate PySpark with the other PyData tools, learning the framework can help you boost your productivity significantly as a data science practitioner.

We think that the best way to teach data science is by example. To that end, we have put together a book of applications, trying to touch on the interactions between the most common algorithms, datasets, and design patterns in large-scale analytics. This book isn’t meant to be read cover to cover: page to a chapter ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Mastering Big Data Analytics with PySpark

Publisher Resources

ISBN: 9781098103644Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.