book

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

June 2022

Beginner to intermediate

233 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Did We Write This Book Now?How This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Analyzing Big Data
Working with Big DataIntroducing Apache Spark and PySparkComponentsPySparkEcosystemSpark 3.0PySpark Addresses Challenges of Data ScienceWhere to Go from Here
2. Introduction to Data Analysis with PySpark
Spark ArchitectureInstalling PySparkSetting Up Our DataAnalyzing Data with the DataFrame APIFast Summary Statistics for DataFramesPivoting and Reshaping DataFramesJoining DataFrames and Selecting FeaturesScoring and Model EvaluationWhere to Go from Here
3. Recommending Music and the Audioscrobbler Dataset
Setting Up the DataOur Requirements for a Recommender SystemAlternating Least Squares AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here
4. Making Predictions with Decision Trees and Decision Forests
Decision Trees and ForestsPreparing the DataOur First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom ForestsMaking PredictionsWhere to Go from Here
5. Anomaly Detection with K-means Clustering
K-means ClusteringIdentifying Anomalous Network TrafficKDD Cup 1999 DatasetA First Take on ClusteringChoosing kVisualization with SparkRFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here
6. Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet AllocationLDA in PySparkGetting the DataSpark NLPSetting Up Your EnvironmentParsing the DataPreparing the Data Using Spark NLPTF-IDFComputing the TF-IDFsCreating Our LDA ModelWhere to Go from Here
7. Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the DataConverting Datetime Strings to TimestampsHandling Invalid RecordsGeospatial AnalysisIntro to GeoJSONGeoPandasSessionization in PySparkBuilding Sessions: Secondary Sorts in PySparkWhere to Go from Here
8. Estimating Financial Risk
TerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreparing the DataDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsWhere to Go from Here
9. Analyzing Genomics Data and the BDG Project
Decoupling Storage from ModelingSetting Up ADAMIntroduction to Working with Genomics Data Using ADAMFile Format Conversion with the ADAM CLIIngesting Genomics Data Using PySpark and ADAMPredicting Transcription Factor Binding Sites from ENCODE DataWhere to Go from Here

10. Image Similarity Detection with Deep Learning and PySpark LSH
PyTorchInstallationPreparing the DataResizing Images Using PyTorchDeep Learning Model for Vector Representation of ImagesImage EmbeddingsImport Image Embeddings into PySparkImage Similarity Search Using PySpark LSHNearest Neighbor SearchWhere to Go from Here
11. Managing the Machine Learning Lifecycle with MLflow
Machine Learning LifecycleMLflowExperiment TrackingManaging and Serving ML ModelsCreating and Using MLflow ProjectsWhere to Go from Here
Index
About the Authors

Content preview from Advanced Analytics with PySpark

Chapter 9. Analyzing Genomics Data and the BDG Project

The advent of next-generation DNA sequencing (NGS) technology has rapidly transformed the life sciences into a data-driven field. However, making the best use of this data is butting up against a traditional computational ecosystem that builds on difficult-to-use, low-level primitives for distributed computing and a jungle of semistructured text-based file formats.

This chapter will serve two primary purposes. First, we introduce a set of popular serialization and file formats (Avro and Parquet) that simplify many problems in data management. These serialization technologies enable us to convert data into compact, machine-friendly binary representations. This helps with movement of data across networks and helps with cross-compatibility across programming languages. Although we will use data serialization techniques with genomics data, the concepts will be useful whenever processing large amounts of data.

Second, we show how to perform typical genomics tasks in the PySpark ecosystem. Specifically, we’ll use PySpark and the open source ADAM library to manipulate large quantities of genomics data and process data from multiple sources to create a dataset for predicting transcription factor (TF) binding sites. For this, we will join genome annotations from the ENCODE dataset. This chapter will serve as a tutorial to the ADAM project, which comprises a set of genomics-specific Avro schemas, PySpark-based APIs, and command-line ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Mastering Big Data Analytics with PySpark

Publisher Resources

ISBN: 9781098103644Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced Analytics with PySpark

by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 9. Analyzing Genomics Data and the BDG Project

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.