Databricks Fundamentals Bootcamp

Beginner

Building end-to-end data engineering solutions

Course outcomes

Understand how to get started with the Databricks platform
Use Apache Spark on Databricks to explore, clean, and transform data
Store data reliably in data lake using Delta Lake
Understand what Databricks Unity Catalog is and how to use it
Learn how to build production-ready workflows with Databricks Workflows

Course description

The amount of data in the world is growing at an exponential pace. To unlock the potential of this data, the ability to process, store, and analyze data at scale is a necessity, tasks that the Databricks platform can help you perform.

Join expert Mohit Batra to learn the fundamentals of the Databricks platform and how to use it to work with large volumes of data. You’ll learn how to set up a Databricks workspace and environment, how Apache Spark can be used to extract data from different file formats, and how to use Spark to explore, clean, and transform the data. You'll also learn how to load the data to a data lake after processing, and how to utilize Delta Lake to provide consistency to the stored data and build data warehouse-like features on top of the data lake. Next, you'll learn how to manage and govern data using the Databricks Unity Catalog. Finally, you'll learn how to build production-ready workflows using Databricks Workflows.

NOTE: With today’s registration, you’ll be signed up for both sessions. Although you can attend either of the sessions individually, we recommend participating in both.

What you’ll learn and how you can apply it

Set up and navigate the Databricks platform
Use Apache Spark to build ETL pipelines by extracting, cleaning, and transforming the data
Build data warehouse like features on top of a data lake by reliably storing the data using Delta format
Use Databricks Unity Catalog to manage and govern data
Build production-ready pipelines using Databricks Workflows

This live event is for you because...

You’re a data engineer or data architect who wants to understand Databricks’ capabilities in data engineering and to integrate best practices into your workflow.
You're a data professional who wants to utilize Databricks in your projects to build data engineering solutions.
You want to become a Databricks data engineer.

Prerequisites

A free Azure trial subscription (preferably with a business/work email account) or a Databricks community edition account
Basic understanding of cloud platforms
(Optional) Basic Python and SQL knowledge

Recommended follow-up:

Read Delta Lake: The Definitive Guide (book)
Read Practical Lakehouse Architecture (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Day 1: Exploring, Cleaning, and Transforming Data with Databricks

Introduction to Databricks (20 minutes)

Presentation: Overview of Databricks platform and its features
Group discussion: Benefits of using Databricks platform for organizations
Q&A

Understanding Apache Spark architecture (30 minutes)

Presentation: How Apache Spark performs distributed data processing; Spark on Databricks
Group discussion: How organizations are using Spark
Q&A
Break

Setting up and exploring Databricks (60 minutes)

Demos: Setting up Databricks workspace; walkthrough of Databricks workspace; creating and using interactive clusters; using serverless compute; using notebooks; working with dbutils
Hands-on exercises: Set up Databricks workspace; configure cluster; use dbutils to work with file system
Q&A
Break

Reading data from multiple file formats (50 minutes)

Presentation: Connecting to external storage; understanding Databricks File System (DBFS); understanding Spark DataFrames; working with file formats like CSV, Parquet, and JSON
Hands-on exercises: Upload files to Databricks File System; read files using Spark's DataFrames API; infer or apply schema to files
Q&A
Break

Cleaning and transforming data using PySpark (55 minutes)

Demos: Operations to clean and transform data; using Databricks Assistant for generating code
Hands-on exercises: Apply clean-up operations (removing duplicates and nulls, filling missing values, and filtering records); apply transformations (selecting or renaming columns, creating derived columns, etc.)
Q&A

Working with Spark SQL and visualizing data (25 minutes)

Demos: Running SQL queries on DataFrames; creating visualizations
Hands-on exercises: Create temporary SQL views on DataFrames; build reports; create charts; add charts to dashboards
Q&A

Day 2: Storing Data with Delta Lake and Building Workflows with Databricks

Introduction to Delta Lake (30 minutes)

Presentation: Challenges with data lakes; Delta format and transaction log; ACID guarantees on data lakes; competitors
Group discussion: How Delta Lake can help build a data warehouse on a data lake
Q&A

Storing data in data lake using Delta format (40 minutes)

Demos: Writing DataFrames in Delta format; checking transaction log; creating and managing Delta tables; audit history; table constraints
Q&A
Break

Working with Delta Lake features (55 minutes)

Demos: Working with Delta Lake features
Hands-on exercises: Perform CRUD operations, schema enforcement and evolution, time travel, and optimization; compare performance with Parquet
Q&A
Break

Working with Unity Catalog (60 minutes)

Presentation: Why Unity Catalog is required; setting up a metastore and catalog; writing a table to catalog and schema
Hands-on exercises: Create a metastore; assign workspace to metastore; create catalog and schema; add a table to catalog
Q&A
Break

Building workflows with Databricks (55 minutes)

Presentation: Understanding different types of pipelines; difference between all-purpose and job cluster; orchestrating tasks and sharing clusters
Hands-on exercises: Create job; connect tasks; create job cluster; schedule job
Q&A

Your Instructor

Mohit Batra
Mohit Batra is a data engineer, a Microsoft Certified Trainer (MCT), and the founder of Crystal Talks, a training and consultancy company. He has 17+ years of experience in architecting large-scale Business Intelligence, Data Warehousing, and Big Data solutions for leading investment banks, as well as companies like Microsoft. Mohit has often shared his knowledge on Azure, Spark, BI and Big Data at various public forums and as a corporate trainer. In his free time, Mohit enjoys reading, photography, and music.
linkedin search

Skill covered

Databricks

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills