Databricks Fundamentals Bootcamp
Published by O'Reilly Media, Inc.
Building end-to-end data engineering solutions
Course outcomes
- Understand how to get started with the Databricks platform
- Use Apache Spark on Databricks to explore, clean, and transform data
- Store data reliably in data lake using Delta Lake
- Understand what Databricks Unity Catalog is and how to use it
- Learn how to build production-ready workflows with Databricks Workflows
Course description
The amount of data in the world is growing at an exponential pace. To unlock the potential of this data, the ability to process, store, and analyze data at scale is a necessity, tasks that the Databricks platform can help you perform.
Join expert Mohit Batra to learn the fundamentals of the Databricks platform and how to use it to work with large volumes of data. You’ll learn how to set up a Databricks workspace and environment, how Apache Spark can be used to extract data from different file formats, and how to use Spark to explore, clean, and transform the data. You'll also learn how to load the data to a data lake after processing, and how to utilize Delta Lake to provide consistency to the stored data and build data warehouse-like features on top of the data lake. Next, you'll learn how to manage and govern data using the Databricks Unity Catalog. Finally, you'll learn how to build production-ready workflows using Databricks Workflows.
NOTE: With today’s registration, you’ll be signed up for both sessions. Although you can attend either of the sessions individually, we recommend participating in both.
What you’ll learn and how you can apply it
- Set up and navigate the Databricks platform
- Use Apache Spark to build ETL pipelines by extracting, cleaning, and transforming the data
- Build data warehouse like features on top of a data lake by reliably storing the data using Delta format
- Use Databricks Unity Catalog to manage and govern data
- Build production-ready pipelines using Databricks Workflows
This live event is for you because...
- You’re a data engineer or data architect who wants to understand Databricks’ capabilities in data engineering and to integrate best practices into your workflow.
- You're a data professional who wants to utilize Databricks in your projects to build data engineering solutions.
- You want to become a Databricks data engineer.
Prerequisites
- A free Azure trial subscription (preferably with a business/work email account) or a Databricks community edition account
- Basic understanding of cloud platforms
- (Optional) Basic Python and SQL knowledge
Recommended follow-up:
- Read Delta Lake: The Definitive Guide (book)
- Read Practical Lakehouse Architecture (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Day 1: Exploring, Cleaning, and Transforming Data with Databricks
Introduction to Databricks (20 minutes)
- Presentation: Overview of Databricks platform and its features
- Group discussion: Benefits of using Databricks platform for organizations
- Q&A
Understanding Apache Spark architecture (30 minutes)
- Presentation: How Apache Spark performs distributed data processing; Spark on Databricks
- Group discussion: How organizations are using Spark
- Q&A
- Break
Setting up and exploring Databricks (60 minutes)
- Demos: Setting up Databricks workspace; walkthrough of Databricks workspace; creating and using interactive clusters; using serverless compute; using notebooks; working with dbutils
- Hands-on exercises: Set up Databricks workspace; configure cluster; use dbutils to work with file system
- Q&A
- Break
Reading data from multiple file formats (50 minutes)
- Presentation: Connecting to external storage; understanding Databricks File System (DBFS); understanding Spark DataFrames; working with file formats like CSV, Parquet, and JSON
- Hands-on exercises: Upload files to Databricks File System; read files using Spark's DataFrames API; infer or apply schema to files
- Q&A
- Break
Cleaning and transforming data using PySpark (55 minutes)
- Demos: Operations to clean and transform data; using Databricks Assistant for generating code
- Hands-on exercises: Apply clean-up operations (removing duplicates and nulls, filling missing values, and filtering records); apply transformations (selecting or renaming columns, creating derived columns, etc.)
- Q&A
Working with Spark SQL and visualizing data (25 minutes)
- Demos: Running SQL queries on DataFrames; creating visualizations
- Hands-on exercises: Create temporary SQL views on DataFrames; build reports; create charts; add charts to dashboards
- Q&A
Day 2: Storing Data with Delta Lake and Building Workflows with Databricks
Introduction to Delta Lake (30 minutes)
- Presentation: Challenges with data lakes; Delta format and transaction log; ACID guarantees on data lakes; competitors
- Group discussion: How Delta Lake can help build a data warehouse on a data lake
- Q&A
Storing data in data lake using Delta format (40 minutes)
- Demos: Writing DataFrames in Delta format; checking transaction log; creating and managing Delta tables; audit history; table constraints
- Q&A
- Break
Working with Delta Lake features (55 minutes)
- Demos: Working with Delta Lake features
- Hands-on exercises: Perform CRUD operations, schema enforcement and evolution, time travel, and optimization; compare performance with Parquet
- Q&A
- Break
Working with Unity Catalog (60 minutes)
- Presentation: Why Unity Catalog is required; setting up a metastore and catalog; writing a table to catalog and schema
- Hands-on exercises: Create a metastore; assign workspace to metastore; create catalog and schema; add a table to catalog
- Q&A
- Break
Building workflows with Databricks (55 minutes)
- Presentation: Understanding different types of pipelines; difference between all-purpose and job cluster; orchestrating tasks and sharing clusters
- Hands-on exercises: Create job; connect tasks; create job cluster; schedule job
- Q&A
Your Instructor
Mohit Batra
Mohit Batra is a data engineer, a Microsoft Certified Trainer (MCT), and the founder of Crystal Talks, a training and consultancy company. He has 17+ years of experience in architecting large-scale Business Intelligence, Data Warehousing, and Big Data solutions for leading investment banks, as well as companies like Microsoft. Mohit has often shared his knowledge on Azure, Spark, BI and Big Data at various public forums and as a corporate trainer. In his free time, Mohit enjoys reading, photography, and music.