book

Essential PySpark for Scalable Data Analytics

Name: Essential PySpark for Scalable Data Analytics
Author: Sreeram Nudurupati
ISBN: 9781800568877

by Sreeram Nudurupati

October 2021

Beginner to intermediate

322 pages

7h 27m

English

Packt Publishing

Read now

Unlock full access

Essential PySpark for Scalable Data Analytics
ContributorsAbout the authorAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare your thoughts
Section 1: Data Engineering
Chapter 1: Distributed Computing Primer
Technical requirementsDistributed ComputingIntroduction to Distributed ComputingData Parallel ProcessingData Parallel Processing using the MapReduce paradigmDistributed Computing with Apache SparkIntroduction to Apache SparkData Parallel Processing with RDDsHigher-order functionsApache Spark cluster architectureGetting started with SparkBig data processing with Spark SQL and DataFramesTransforming data with Spark DataFramesUsing SQL on Spark What's new in Apache Spark 3.0?Summary
Chapter 2: Data Ingestion
Technical requirementsIntroduction to Enterprise Decision Support SystemsIngesting data from data sourcesIngesting from relational data sourcesIngesting from file-based data sourcesIngesting from message queuesIngesting data into data sinksIngesting into data warehousesIngesting into data lakesIngesting into NoSQL and in-memory data storesUsing file formats for data storage in data lakesUnstructured data storage formatsSemi-structured data storage formatsStructured data storage formatsBuilding data ingestion pipelines in batch and real timeData ingestion using batch processingData ingestion in real time using structured streamingUnifying batch and real time using Lambda ArchitectureLambda ArchitectureThe Batch layerThe Speed layerThe Serving layerSummary
Chapter 3: Data Cleansing and Integration
Technical requirementsTransforming raw data into enriched meaningful dataExtracting, transforming, and loading dataExtracting, loading, and transforming dataAdvantages of choosing ELT over ETLBuilding analytical data stores using cloud data lakesChallenges with cloud data lakesOvercoming data lake challenges with Delta LakeConsolidating data using data integrationData consolidation via ETL and data warehousingIntegrating data using data virtualization techniquesData integration through data federationMaking raw data analytics-ready using data cleansing Data selection to eliminate redundanciesDe-duplicating dataStandardizing dataOptimizing ELT processing performance with data partitioningSummary
Chapter 4: Real-Time Data Analytics
Technical requirementsReal-time analytics systems architectureStreaming data sourcesStreaming data sinksStream processing enginesReal-time data consumersReal-time analytics industry use casesReal-time predictive analytics in manufacturingConnected vehicles in the automotive sectorFinancial fraud detectionIT security threat detectionSimplifying the Lambda Architecture using Delta LakeChange Data CaptureHandling late-arriving dataStateful stream processing using windowing and watermarkingMulti-hop pipelinesSummary
Section 2: Data Science
Chapter 5: Scalable Machine Learning with PySpark
Technical requirementsML overviewTypes of ML algorithmsBusiness use cases of MLScaling out machine learningTechniques for scaling MLIntroduction to Apache Spark's ML libraryData wrangling with Apache Spark and MLlibData preprocessingData cleansingData manipulationSummary
Chapter 6: Feature Engineering – Extraction, Transformation, and Selection
Technical requirementsThe machine learning processFeature extractionFeature transformationTransforming categorical variablesTransforming continuous variablesTransforming the date and time variablesAssembling individual features into a feature vectorFeature scalingFeature selectionFeature store as a central feature repositoryBatch inferencing using the offline feature storeDelta Lake as an offline feature storeStructure and metadata with Delta tablesSchema enforcement and evolution with Delta LakeSupport for simultaneous batch and streaming workloadsDelta Lake time travelIntegration with machine learning operations toolsOnline feature store for real-time inferencingSummary

Chapter 7: Supervised Machine Learning
Technical requirementsIntroduction to supervised machine learningParametric machine learningNon-parametric machine learningRegressionLinear regressionRegression using decision treesClassificationLogistic regressionClassification using decision treesNaïve BayesSupport vector machinesTree ensemblesRegression using random forestsClassification using random forestsRegression using gradient boosted treesClassification using GBTsReal-world supervised learning applicationsRegression applicationsClassification applicationsSummary
Chapter 8: Unsupervised Machine Learning
Technical requirementsIntroduction to unsupervised machine learningClustering using machine learningK-means clusteringHierarchical clustering using bisecting K-meansTopic modeling using latent Dirichlet allocationGaussian mixture modelBuilding association rules using machine learningCollaborative filtering using alternating least squaresReal-world applications of unsupervised learning Clustering applicationsAssociation rules and collaborative filtering applicationsSummary
Chapter 9: Machine Learning Life Cycle Management
Technical requirementsIntroduction to the ML life cycleIntroduction to MLflowTracking experiments with MLflowML model tuningTracking model versions using MLflow Model RegistryModel serving and inferencingOffline model inferencingOnline model inferencingContinuous delivery for MLSummary
Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark
Technical requirementsScaling out EDAEDA using pandasEDA using PySparkScaling out model inferencingModel training using embarrassingly parallel computingDistributed hyperparameter tuning Scaling out arbitrary Python code using pandas UDFUpgrading pandas to PySpark using KoalasSummary
Section 3: Data Analysis
Chapter 11: Data Visualization with PySpark
Technical requirementsImportance of data visualizationTypes of data visualization toolsTechniques for visualizing data using PySparkPySpark native data visualizationsUsing Python data visualizations with PySparkConsiderations for PySpark to pandas conversionIntroduction to pandasConverting from PySpark into pandasSummary
Chapter 12: Spark SQL Primer
Technical requirementsIntroduction to SQLDDLDMLJoins and sub-queriesRow-based versus columnar storageIntroduction to Spark SQLCatalyst optimizerSpark SQL data sources Spark SQL language referenceSpark SQL DDLSpark DMLOptimizing Spark SQL performanceSummary
Chapter 13: Integrating External Tools with Spark SQL
Technical requirementsApache Spark as a distributed SQL engineIntroduction to Hive Thrift JDBC/ODBC ServerSpark connectivity to SQL analysis toolsSpark connectivity to BI toolsConnecting Python applications to Spark SQL using PyodbcSummary
Chapter 14: The Data Lakehouse
Moving from BI to AIChallenges with data warehousesChallenges with data lakesThe data lakehouse paradigmKey requirements of a data lakehouseData lakehouse architectureExamples of existing lakehouse architecturesApache Spark-based data lakehouse architectureAdvantages of data lakehousesSummary
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare your thoughts

Content preview from Essential PySpark for Scalable Data Analytics

Chapter 3: Data Cleansing and Integration

In the previous chapter, you were introduced to the first step of the data analytics process – that is, ingesting raw, transactional data from various source systems into a cloud-based data lake. Once we have the raw data available, we need to process, clean, and transform it into a format that helps with extracting meaningful, actionable business insights. This process of cleaning, processing, and transforming raw data is known as data cleansing and integration. This is what you will learn about in this chapter.

Raw data sourced from operational systems is not conducive for data analytics in its raw format. In this chapter, you will learn about various data integration techniques, which are useful in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Big Data Analytics with PySpark

Publisher Resources

ISBN: 9781800568877

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Essential PySpark for Scalable Data Analytics

by Sreeram Nudurupati

Chapter 3: Data Cleansing and Integration

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.