book

Essential PySpark for Scalable Data Analytics

by Sreeram Nudurupati

October 2021

Beginner to intermediate

322 pages

7h 27m

English

Packt Publishing

Read now

Unlock full access

ContributorsAbout the authorAbout the reviewers
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare your thoughts
Technical requirementsDistributed ComputingIntroduction to Distributed ComputingData Parallel ProcessingData Parallel Processing using the MapReduce paradigmDistributed Computing with Apache SparkIntroduction to Apache SparkData Parallel Processing with RDDsHigher-order functionsApache Spark cluster architectureGetting started with SparkBig data processing with Spark SQL and DataFramesTransforming data with Spark DataFramesUsing SQL on Spark What's new in Apache Spark 3.0?Summary
Technical requirementsIntroduction to Enterprise Decision Support SystemsIngesting data from data sourcesIngesting from relational data sourcesIngesting from file-based data sourcesIngesting from message queuesIngesting data into data sinksIngesting into data warehousesIngesting into data lakesIngesting into NoSQL and in-memory data storesUsing file formats for data storage in data lakesUnstructured data storage formatsSemi-structured data storage formatsStructured data storage formatsBuilding data ingestion pipelines in batch and real timeData ingestion using batch processingData ingestion in real time using structured streamingUnifying batch and real time using Lambda ArchitectureLambda ArchitectureThe Batch layerThe Speed layerThe Serving layerSummary
Technical requirementsTransforming raw data into enriched meaningful dataExtracting, transforming, and loading dataExtracting, loading, and transforming dataAdvantages of choosing ELT over ETLBuilding analytical data stores using cloud data lakesChallenges with cloud data lakesOvercoming data lake challenges with Delta LakeConsolidating data using data integrationData consolidation via ETL and data warehousingIntegrating data using data virtualization techniquesData integration through data federationMaking raw data analytics-ready using data cleansing Data selection to eliminate redundanciesDe-duplicating dataStandardizing dataOptimizing ELT processing performance with data partitioningSummary
Technical requirementsReal-time analytics systems architectureStreaming data sourcesStreaming data sinksStream processing enginesReal-time data consumersReal-time analytics industry use casesReal-time predictive analytics in manufacturingConnected vehicles in the automotive sectorFinancial fraud detectionIT security threat detectionSimplifying the Lambda Architecture using Delta LakeChange Data CaptureHandling late-arriving dataStateful stream processing using windowing and watermarkingMulti-hop pipelinesSummary
Technical requirementsML overviewTypes of ML algorithmsBusiness use cases of MLScaling out machine learningTechniques for scaling MLIntroduction to Apache Spark's ML libraryData wrangling with Apache Spark and MLlibData preprocessingData cleansingData manipulationSummary
Technical requirementsThe machine learning processFeature extractionFeature transformationTransforming categorical variablesTransforming continuous variablesTransforming the date and time variablesAssembling individual features into a feature vectorFeature scalingFeature selectionFeature store as a central feature repositoryBatch inferencing using the offline feature storeDelta Lake as an offline feature storeStructure and metadata with Delta tablesSchema enforcement and evolution with Delta LakeSupport for simultaneous batch and streaming workloadsDelta Lake time travelIntegration with machine learning operations toolsOnline feature store for real-time inferencingSummary

Technical requirementsIntroduction to supervised machine learningParametric machine learningNon-parametric machine learningRegressionLinear regressionRegression using decision treesClassificationLogistic regressionClassification using decision treesNaïve BayesSupport vector machinesTree ensemblesRegression using random forestsClassification using random forestsRegression using gradient boosted treesClassification using GBTsReal-world supervised learning applicationsRegression applicationsClassification applicationsSummary
Technical requirementsIntroduction to unsupervised machine learningClustering using machine learningK-means clusteringHierarchical clustering using bisecting K-meansTopic modeling using latent Dirichlet allocationGaussian mixture modelBuilding association rules using machine learningCollaborative filtering using alternating least squaresReal-world applications of unsupervised learning Clustering applicationsAssociation rules and collaborative filtering applicationsSummary
Technical requirementsIntroduction to the ML life cycleIntroduction to MLflowTracking experiments with MLflowML model tuningTracking model versions using MLflow Model RegistryModel serving and inferencingOffline model inferencingOnline model inferencingContinuous delivery for MLSummary
Technical requirementsScaling out EDAEDA using pandasEDA using PySparkScaling out model inferencingModel training using embarrassingly parallel computingDistributed hyperparameter tuning Scaling out arbitrary Python code using pandas UDFUpgrading pandas to PySpark using KoalasSummary
Technical requirementsImportance of data visualizationTypes of data visualization toolsTechniques for visualizing data using PySparkPySpark native data visualizationsUsing Python data visualizations with PySparkConsiderations for PySpark to pandas conversionIntroduction to pandasConverting from PySpark into pandasSummary
Technical requirementsIntroduction to SQLDDLDMLJoins and sub-queriesRow-based versus columnar storageIntroduction to Spark SQLCatalyst optimizerSpark SQL data sources Spark SQL language referenceSpark SQL DDLSpark DMLOptimizing Spark SQL performanceSummary
Technical requirementsApache Spark as a distributed SQL engineIntroduction to Hive Thrift JDBC/ODBC ServerSpark connectivity to SQL analysis toolsSpark connectivity to BI toolsConnecting Python applications to Spark SQL using PyodbcSummary
Moving from BI to AIChallenges with data warehousesChallenges with data lakesThe data lakehouse paradigmKey requirements of a data lakehouseData lakehouse architectureExamples of existing lakehouse architecturesApache Spark-based data lakehouse architectureAdvantages of data lakehousesSummary
Other Books You May EnjoyPackt is searching for authors like youShare your thoughts

Content preview from Essential PySpark for Scalable Data Analytics

Chapter 1: Distributed Computing Primer

This chapter introduces you to the Distributed Computing paradigm and shows you how Distributed Computing can help you to easily process very large amounts of data. You will learn about the concept of Data Parallel Processing using the MapReduce paradigm and, finally, learn how Data Parallel Processing can be made more efficient by using an in-memory, unified data processing engine such as Apache Spark.

Then, you will dive deeper into the architecture and components of Apache Spark along with code examples. Finally, you will get an overview of what's new with the latest 3.0 release of Apache Spark.

In this chapter, the key skills that you will acquire include an understanding of the basics of the Distributed ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781800568877Supplemental Content

Essential PySpark for Scalable Data Analytics

by Sreeram Nudurupati

Chapter 1: Distributed Computing Primer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Data Analytics with Hadoop

Simplify Big Data Analytics with Amazon EMR

Simplifying Data Engineering and Analytics with Delta

Data Science on AWS

Publisher Resources

Chapter 1: Distributed Computing Primer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Data Analytics with Hadoop

Simplify Big Data Analytics with Amazon EMR

Simplifying Data Engineering and Analytics with Delta

Data Science on AWS

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.