book

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

June 2022

Beginner to intermediate

233 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Did We Write This Book Now?How This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Analyzing Big Data
Working with Big DataIntroducing Apache Spark and PySparkComponentsPySparkEcosystemSpark 3.0PySpark Addresses Challenges of Data ScienceWhere to Go from Here
2. Introduction to Data Analysis with PySpark
Spark ArchitectureInstalling PySparkSetting Up Our DataAnalyzing Data with the DataFrame APIFast Summary Statistics for DataFramesPivoting and Reshaping DataFramesJoining DataFrames and Selecting FeaturesScoring and Model EvaluationWhere to Go from Here
3. Recommending Music and the Audioscrobbler Dataset
Setting Up the DataOur Requirements for a Recommender SystemAlternating Least Squares AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here
4. Making Predictions with Decision Trees and Decision Forests
Decision Trees and ForestsPreparing the DataOur First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom ForestsMaking PredictionsWhere to Go from Here
5. Anomaly Detection with K-means Clustering
K-means ClusteringIdentifying Anomalous Network TrafficKDD Cup 1999 DatasetA First Take on ClusteringChoosing kVisualization with SparkRFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here
6. Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet AllocationLDA in PySparkGetting the DataSpark NLPSetting Up Your EnvironmentParsing the DataPreparing the Data Using Spark NLPTF-IDFComputing the TF-IDFsCreating Our LDA ModelWhere to Go from Here
7. Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the DataConverting Datetime Strings to TimestampsHandling Invalid RecordsGeospatial AnalysisIntro to GeoJSONGeoPandasSessionization in PySparkBuilding Sessions: Secondary Sorts in PySparkWhere to Go from Here
8. Estimating Financial Risk
TerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreparing the DataDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsWhere to Go from Here
9. Analyzing Genomics Data and the BDG Project
Decoupling Storage from ModelingSetting Up ADAMIntroduction to Working with Genomics Data Using ADAMFile Format Conversion with the ADAM CLIIngesting Genomics Data Using PySpark and ADAMPredicting Transcription Factor Binding Sites from ENCODE DataWhere to Go from Here

10. Image Similarity Detection with Deep Learning and PySpark LSH
PyTorchInstallationPreparing the DataResizing Images Using PyTorchDeep Learning Model for Vector Representation of ImagesImage EmbeddingsImport Image Embeddings into PySparkImage Similarity Search Using PySpark LSHNearest Neighbor SearchWhere to Go from Here
11. Managing the Machine Learning Lifecycle with MLflow
Machine Learning LifecycleMLflowExperiment TrackingManaging and Serving ML ModelsCreating and Using MLflow ProjectsWhere to Go from Here
Index
About the Authors

Content preview from Advanced Analytics with PySpark

Chapter 3. Recommending Music and the Audioscrobbler Dataset

The recommender engine is one of the most popular example of large-scale machine learning; for example, most people are familiar with Amazon’s. It is a common denominator because recommender engines are everywhere, from social networks to video sites to online retailers. We can also directly observe them in action. We’re aware that a computer is picking tracks to play on Spotify, in much the same way we don’t necessarily notice that Gmail is deciding whether inbound email is spam.

The output of a recommender is more intuitively understandable than other machine learning algorithms. It’s exciting, even. For as much as we think that musical taste is personal and inexplicable, recommenders do a surprisingly good job of identifying tracks we didn’t know we would like. For domains like music or movies, where recommenders are often deployed, it’s comparatively easy to reason why a recommended piece of music fits with someone’s listening history. Not all clustering or classification algorithms match that description. For example, a support vector machine classifier is a set of coefficients, and it’s hard even for practitioners to articulate what the numbers mean when they make predictions.

It seems fitting to kick off the next three chapters, which will explore key machine learning algorithms on PySpark, with a chapter built around recommender engines, and recommending music in particular. It’s an accessible way to introduce ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Mastering Big Data Analytics with PySpark

Publisher Resources

ISBN: 9781098103644Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 3. Recommending Music and the Audioscrobbler Dataset

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.