book

Data Engineering with Scala and Spark

Name: Data Engineering with Scala and Spark
ISBN: 9781804612583

by Eric Tome, Rupam Bhattacharjee, David Radford

January 2024

Intermediate to advanced

300 pages

6h 36m

English

Packt Publishing

Read now

Unlock full access

Data Engineering with Scala and Spark
ContributorsAbout the authorsAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedGet in touchShare Your ThoughtsDownload a free PDF copy of this book
Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup
Chapter 1: Scala Essentials for Data Engineers
Technical requirementsUnderstanding functional programmingUnderstanding objects, classes, and traitsClassesObjectTraitWorking with higher-order functions (HOFs)Examples of HOFs from the Scala collection libraryUnderstanding polymorphic functionsVarianceOption typeCollectionsUnderstanding pattern matchingWildcard patternsConstant patternsVariable patternsConstructor patternsSequence patternsTuple patternsTyped patternsImplicits in ScalaSummaryFurther reading
Chapter 2: Environment Setup
Technical requirementsSetting up a cloud environmentLeveraging cloud object storageUsing DatabricksLocal environment setupThe build toolSummaryFurther reading
Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL
Technical requirementsWorking with Apache SparkHow do Spark applications work?What happens on executors?Creating a Spark application using ScalaSpark stagesShufflingUnderstanding the Spark Dataset APIUnderstanding the Spark DataFrame APISpark SQLThe select functionCreating temporary viewsSummary
Chapter 4: Working with Databases
Technical requirementsUnderstanding the Spark JDBC APIWorking with the Spark JDBC APILoading the database configurationCreating a database interfaceCreating a factory method for SparkSessionPerforming various database operationsWorking with databasesUpdating the Database API with Spark read and writeSummary
Chapter 5: Object Stores and Data Lakes
Understanding distributed file systemsData lakesObject storesStreaming dataWorking with streaming sourcesProcessing and sinksAggregating streamsSummary
Chapter 6: Understanding Data Transformation
Technical requirementsUnderstanding the difference between transformations and actionsUsing Select and SelectExprFiltering and sortingLearning how to aggregate, group, and join dataLeveraging advanced window functionsWorking with complex dataset typesSummary

Chapter 7: Data Profiling and Data Quality
Technical requirementsUnderstanding components of DeequPerforming data analysisLeveraging automatic constraint suggestionDefining constraintsStoring metrics using MetricsRepositoryDetecting anomaliesSummary
Part 3 – Software Engineering Best Practices for Data Engineering in Scala
Chapter 8: Test-Driven Development, Code Health, and Maintainability
Technical requirementsIntroducing TDDCreating unit testsPerforming integration testingChecking code coverageRunning static code analysisInstalling SonarQube locallyCreating a projectRunning SonarScannerUnderstanding linting and code styleLinting code with WartRemoverFormatting code using scalafmtSummary
Chapter 9: CI/CD with GitHub
Technical requirementsIntroducing CI/CD and GitHubUnderstanding Continuous Integration (CI)Understanding Continuous Delivery (CD)Understanding the big picture of CI/CDWorking with GitHubCloning a repositoryUnderstanding branchesWriting, committing, and pushing codeCreating pull requestsReviewing and merging pull requestsUnderstanding GitHub ActionsWorkflowsJobsStepsSummary
Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning
Chapter 10: Data Pipeline Orchestration
Technical requirementsUnderstanding the basics of orchestrationUnderstanding core features of Apache AirflowApache Airflow’s extensibilityExtending beyond operatorsMonitoring and UIHosting and deployment optionsDesigning data pipelines with AirflowWorking with Argo WorkflowsInstalling Argo WorkflowsUnderstanding the core components of Argo WorkflowsTaking a short detourCreating an Argo workflowUsing Databricks WorkflowsLeveraging Azure Data FactoryPrimary components of ADFSummary
Chapter 11: Performance Tuning
Introducing the Spark UINavigating the Spark UIThe Jobs tab – overview of job executionLeveraging the Spark UI for performance tuningIdentifying performance bottlenecksOptimizing data shufflingMemory management and garbage collectionScaling resourcesAnalyzing SQL query performanceRight-sizing compute resourcesUnderstanding the basicsUnderstanding data skewing, indexing, and partitioningData skewIndexing and partitioningSummary
Part 5 – End-to-End Data Pipelines
Chapter 12: Building Batch Pipelines Using Spark and Scala
Understanding our business use caseWhat’s our marketing use case?Understanding the dataUnderstanding the medallion architectureThe end-to-end pipelineIngesting the dataTransforming the dataChecking data qualityCreating a serving layerOrchestrating our batch processSummary
Chapter 13: Building Streaming Pipelines Using Spark and Scala
Understanding our business use caseWhat’s our IoT use case?Understanding the dataThe end-to-end pipelineIngesting the dataTransforming the dataCreating a serving layerOrchestrating our streaming processSummary
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your ThoughtsDownload a free PDF copy of this book

Content preview from Data Engineering with Scala and Spark

3 An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

Apache Spark is written in Scala and has become the dominant distributed data processing framework due to its ability to ingest, enrich, and prepare at-scale data for analytical use cases. As a data engineer, you will eventually have to work with data volumes that won’t be processable on a single machine. This chapter will teach you how to leverage Spark and its various APIs to do that processing on a cluster of machines.

In this chapter, we’re going to cover the following main topics:

Working with Apache Spark
Creating a Spark application using Scala
Understanding the Spark Dataset API
Understanding the Spark DataFrame API

Technical requirements

Please refer ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Publisher Resources

ISBN: 9781804612583

Cloud Computing