book

Essential PySpark for Scalable Data Analytics

by Sreeram Nudurupati

October 2021

Beginner to intermediate

322 pages

7h 27m

English

Packt Publishing

Read now

Unlock full access

ContributorsAbout the authorAbout the reviewers
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare your thoughts
Technical requirementsDistributed ComputingIntroduction to Distributed ComputingData Parallel ProcessingData Parallel Processing using the MapReduce paradigmDistributed Computing with Apache SparkIntroduction to Apache SparkData Parallel Processing with RDDsHigher-order functionsApache Spark cluster architectureGetting started with SparkBig data processing with Spark SQL and DataFramesTransforming data with Spark DataFramesUsing SQL on Spark What's new in Apache Spark 3.0?Summary
Technical requirementsIntroduction to Enterprise Decision Support SystemsIngesting data from data sourcesIngesting from relational data sourcesIngesting from file-based data sourcesIngesting from message queuesIngesting data into data sinksIngesting into data warehousesIngesting into data lakesIngesting into NoSQL and in-memory data storesUsing file formats for data storage in data lakesUnstructured data storage formatsSemi-structured data storage formatsStructured data storage formatsBuilding data ingestion pipelines in batch and real timeData ingestion using batch processingData ingestion in real time using structured streamingUnifying batch and real time using Lambda ArchitectureLambda ArchitectureThe Batch layerThe Speed layerThe Serving layerSummary
Technical requirementsTransforming raw data into enriched meaningful dataExtracting, transforming, and loading dataExtracting, loading, and transforming dataAdvantages of choosing ELT over ETLBuilding analytical data stores using cloud data lakesChallenges with cloud data lakesOvercoming data lake challenges with Delta LakeConsolidating data using data integrationData consolidation via ETL and data warehousingIntegrating data using data virtualization techniquesData integration through data federationMaking raw data analytics-ready using data cleansing Data selection to eliminate redundanciesDe-duplicating dataStandardizing dataOptimizing ELT processing performance with data partitioningSummary
Technical requirementsReal-time analytics systems architectureStreaming data sourcesStreaming data sinksStream processing enginesReal-time data consumersReal-time analytics industry use casesReal-time predictive analytics in manufacturingConnected vehicles in the automotive sectorFinancial fraud detectionIT security threat detectionSimplifying the Lambda Architecture using Delta LakeChange Data CaptureHandling late-arriving dataStateful stream processing using windowing and watermarkingMulti-hop pipelinesSummary
Technical requirementsML overviewTypes of ML algorithmsBusiness use cases of MLScaling out machine learningTechniques for scaling MLIntroduction to Apache Spark's ML libraryData wrangling with Apache Spark and MLlibData preprocessingData cleansingData manipulationSummary
Technical requirementsThe machine learning processFeature extractionFeature transformationTransforming categorical variablesTransforming continuous variablesTransforming the date and time variablesAssembling individual features into a feature vectorFeature scalingFeature selectionFeature store as a central feature repositoryBatch inferencing using the offline feature storeDelta Lake as an offline feature storeStructure and metadata with Delta tablesSchema enforcement and evolution with Delta LakeSupport for simultaneous batch and streaming workloadsDelta Lake time travelIntegration with machine learning operations toolsOnline feature store for real-time inferencingSummary

Technical requirementsIntroduction to supervised machine learningParametric machine learningNon-parametric machine learningRegressionLinear regressionRegression using decision treesClassificationLogistic regressionClassification using decision treesNaïve BayesSupport vector machinesTree ensemblesRegression using random forestsClassification using random forestsRegression using gradient boosted treesClassification using GBTsReal-world supervised learning applicationsRegression applicationsClassification applicationsSummary
Technical requirementsIntroduction to unsupervised machine learningClustering using machine learningK-means clusteringHierarchical clustering using bisecting K-meansTopic modeling using latent Dirichlet allocationGaussian mixture modelBuilding association rules using machine learningCollaborative filtering using alternating least squaresReal-world applications of unsupervised learning Clustering applicationsAssociation rules and collaborative filtering applicationsSummary
Technical requirementsIntroduction to the ML life cycleIntroduction to MLflowTracking experiments with MLflowML model tuningTracking model versions using MLflow Model RegistryModel serving and inferencingOffline model inferencingOnline model inferencingContinuous delivery for MLSummary
Technical requirementsScaling out EDAEDA using pandasEDA using PySparkScaling out model inferencingModel training using embarrassingly parallel computingDistributed hyperparameter tuning Scaling out arbitrary Python code using pandas UDFUpgrading pandas to PySpark using KoalasSummary
Technical requirementsImportance of data visualizationTypes of data visualization toolsTechniques for visualizing data using PySparkPySpark native data visualizationsUsing Python data visualizations with PySparkConsiderations for PySpark to pandas conversionIntroduction to pandasConverting from PySpark into pandasSummary
Technical requirementsIntroduction to SQLDDLDMLJoins and sub-queriesRow-based versus columnar storageIntroduction to Spark SQLCatalyst optimizerSpark SQL data sources Spark SQL language referenceSpark SQL DDLSpark DMLOptimizing Spark SQL performanceSummary
Technical requirementsApache Spark as a distributed SQL engineIntroduction to Hive Thrift JDBC/ODBC ServerSpark connectivity to SQL analysis toolsSpark connectivity to BI toolsConnecting Python applications to Spark SQL using PyodbcSummary
Moving from BI to AIChallenges with data warehousesChallenges with data lakesThe data lakehouse paradigmKey requirements of a data lakehouseData lakehouse architectureExamples of existing lakehouse architecturesApache Spark-based data lakehouse architectureAdvantages of data lakehousesSummary
Other Books You May EnjoyPackt is searching for authors like youShare your thoughts

Content preview from Essential PySpark for Scalable Data Analytics

Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

In Chapter 5, Scalable Machine Learning with PySpark, you learned how you could use the power of Apache Spark's distributed computing framework to train and score machine learning (ML) models at scale. Spark's native ML library provides good coverage of standard tasks that data scientists typically perform; however, there is a wide variety of functionality provided by standard single-node Python libraries that were not designed to work in a distributed manner. This chapter deals with techniques for horizontally scaling out standard Python data processing and ML libraries such as pandas, scikit-learn, XGBoost, and more. It also covers scaling out of typical data science tasks ...