book

Distributed Data Systems with Azure Databricks

Name: Distributed Data Systems with Azure Databricks
Author: Alan Bernardo Palacio
ISBN: 9781838647216

by Alan Bernardo Palacio

May 2021

Intermediate to advanced

414 pages

8h 35m

English

Packt Publishing

Read now

Unlock full access

Distributed Data Systems with Azure Databricks
ContributorsAbout the authorAbout the reviewer
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchReviews
Section 1: Introducing Databricks
Chapter 1: Introduction to Azure Databricks
Technical requirementsIntroducing Apache SparkIntroducing Azure DatabricksExamining the architecture of DatabricksDiscovering core concepts and terminologyInteracting with the Azure Databricks workspaceWorkspace assetsWorkspace object operationsUsing Azure Databricks notebooksCreating and managing notebooksNotebooks and clustersExploring data managementDatabases and tablesViewing databases and tablesImporting dataCreating a tableTable detailsExploring computation managementDisplaying clustersStarting a clusterTerminating a clusterDeleting a clusterCluster information Cluster logsExploring authentication and authorizationClustering access controlFolder permissionsNotebook permissionsMLflow Model permissionsSummary
Chapter 2: Creating an Azure Databricks Workspace
Technical requirementsUsing the Azure portal UIAccessing the Workspace UIConfiguring an Azure Databricks clusterCreating a new notebookExamining Azure Databricks authenticationAccess controlWorking with VNets in Azure DatabricksVirtual network requirementsDeploying to your own VNetAzure Resource Manager templatesCreating an Azure Databricks workspace with an ARM templateReviewing deployed resourcesCleaning up resourcesSetting up the Azure Databricks CLIAuthentication through an access tokenAuthentication using an Azure AD tokenValidating the installationWorkspace CLIUsing the CLI to explore the workplaceClusters CLIJobs CLIGroups APIThe Databricks CLI from Azure Cloud ShellSummary
Section 2: Data Pipelines with Databricks
Chapter 3: Creating ETL Operations with Azure Databricks
Technical requirementsUsing ADLS Gen2Setting up a basic ADLS Gen2 data lakeUploading data to ADLS Gen2Accessing ADLS Gen2 from Azure DatabricksLoading data from ADLS Gen2Using S3 with Azure DatabricksConnecting to S3 Loading data into a Spark DataFrameUsing Azure Blob storage with Azure DatabricksSetting up Azure Blob storageUploading files and access keysSetting up the connection to Azure Blob storageTransforming and cleaning dataSpark data framesQuerying using SQLWriting back table data to Azure Data LakeOrchestrating jobs with Azure DatabricksADFCreating an ADF resourceCreating an ETL in ADFScheduling jobs with Azure DatabricksScheduling a notebook as a jobJob logsSummary
Chapter 4: Delta Lake with Azure Databricks
Technical requirementsIntroducing Delta LakeIngesting data using Delta LakePartner integrationsThe COPY INTO SQL commandAuto LoaderBatching table read and writesCreating a tableReading a Delta tablePartitioning data to speed up queriesQuerying past states of a tableUsing time travel to query tablesWorking with past and present dataSchema validationStreaming table read and writesStreaming from Delta tablesManaging table updates and deletesSpecifying an initial positionStreaming modesOptimization with Delta LakeSummary
Chapter 5: Introducing Delta Engine
Technical requirementsOptimizing file management with Delta EngineMerging small files using bin-packingSkipping dataUsing Z-order clusteringManaging data recencyUnderstanding checkpointsAutomatically optimizing files with Delta EngineUsing caching to improve performanceDelta and Apache Spark cachingCaching a subset of the dataConfiguring the Delta cacheOptimizing queries using DFPUsing DFP Using Bloom filtersUnderstanding Bloom filtersBloom filters in Azure DatabricksCreating a Bloom filter indexOptimizing join performanceRange join optimizationEnabling range join optimizationSkew join optimizationRelationships and columnsSummary
Chapter 6: Introducing Structured Streaming
Technical requirementsStructured Streaming modelUsing the Structured Streaming APIMapping, filtering, and running aggregationsWindowed aggregations on event timeMerging streaming and static dataInteractive queriesUsing different sources with continous streamsUsing a Delta table as a stream sourceAzure Event HubsAuto LoaderApache KafkaAvro dataData sinksRecovering from query failuresOptimizing streaming queriesTriggering streaming query executionsDifferent kinds of triggersTrigger examplesVisualizing data on streaming data framesExample on Structured StreamingSummary

Section 3: Machine and Deep Learning with Databricks
Chapter 7: Using Python Libraries in Azure Databricks
Technical requirementsInstalling libraries in Azure DatabricksWorkspace librariesCluster librariesNotebook-scoped Python librariesPySpark APIMain functionalities of PySparkOperating with PySpark DataFramespandas DataFrame API (Koalas)Using the Koalas APIUsing SQL in KoalasWorking with PySparkVisualizing data BokehMatplotlibPlotlySummary
Chapter 8: Databricks Runtime for Machine Learning
Loading data Reading data from DBFS Reading CSV filesFeature engineeringTokenizerBinarizerPolynomial expansionStringIndexerOne-hot encodingVectorIndexerNormalizerStandardScalerBucketizerElement-wise productTime-series data sourcesJoining time-series data Using the Koalas APIHandling missing valuesExtracting features from textTF-IDF Word2vec Training machine learning models on tabular data Engineering the variablesBuilding the ML modelRegistering the model in the MLflow Model Registry Model servingSummary
Chapter 9: Databricks Runtime for Deep Learning
Technical requirementsLoading data for deep learningUsing TFRecords for distributed learningStructuring TFRecords filesManaging data using TFRecordsAutomating schema inferenceUsing TFRecordDataset to load dataUsing Petastorm for distributed learningIntroducing PetastormGenerating a datasetReading a datasetUsing Petastorm to prepare data for deep learningData preprocessing and featurizationFeaturization using a pre-trained model for transfer learningFeaturization using pandas UDFsApplying featurization to the DataFrame of imagesSummary
Chapter 10: Model Tracking and Tuning in Azure Databricks
Technical requirementsTuning hyperparameters with AutoMLAutomating model tracking with MLflow Managing MLflow runsAutomating MLflow tracking with MLlibHyperparameter tuning with HyperoptHyperopt conceptsDefining a search spaceApplying best practices in HyperoptOptimizing model selection with scikit-learn, Hyperopt, and MLflowSummary
Chapter 11: Managing and Serving Models with MLflow and MLeap
Technical requirementsManaging machine learning modelsUsing MLflow notebook experimentsRegistering a model using the MLflow APITransitioning a model stageModel Registry exampleExporting and loading pipelines with MLeapServing models with MLflowScoring a modelSummary
Chapter 12: Distributed Deep Learning in Azure Databricks
Technical requirementsDistributed training for deep learningThe ring allreduce techniqueUsing the Horovod distributed learning library in Azure DatabricksInstalling the horovod libraryUsing the horovod libraryTraining a model on a single nodeDistributing training with HorovodRunnerDistributing hyperparameter tuning using Horovod and HyperoptUsing the Spark TensorFlow Distributor packageSummary
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youLeave a review - let other readers know what you think

Content preview from Distributed Data Systems with Azure Databricks

Chapter 1: Introduction to Azure Databricks

Modern information systems work with massive amounts of data, with a constant flow that increases every day at an exponential rate. This flow comes from different sources, including sales information, transactional data, social media, and more. Organizations have to work with this information in processes that include transformation and aggregation to develop applications that seek to extract value from this data.

Apache Spark was developed to process this massive amount of data. Azure Databricks is built on top of Apache Spark, abstracting most of the complexities of implementing it, and with all the benefits that come with integration with other Azure services. This book aims to provide an introduction ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781838647216

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Distributed Data Systems with Azure Databricks

by Alan Bernardo Palacio

Chapter 1: Introduction to Azure Databricks

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.