CI/CD for Data Lakes

Intermediate to advanced

Managing your data like code

This live event utilizes Jupyter Notebook technology

Today, data lakes offer many advantages for cloud users. Everyone who needs storage such as ObjectStore, S3, Azure Blob, and others, can leverage data lakes. They are scalable, cost-effective, relatively easy to use, and have high throughput and a rich application ecosystem. Yet, with data-intensive systems that are a combination of OSS and cloud native services, there are also challenges. As data practitioners, it's tough to experiment, compare, and reproduce data-intensive transactions. Copying large-scale data for experimentation gets pricey. On top of the expense is the difficulty of enforcing data best practices like schema, since it can change on the fly when ingesting data from outside sources. Lastly, it’s also hard to ensure high-quality data. To start working on solutions to these problems, it’s necessary to acknowledge that our systems are made up of data and code. There are tools such as Git and CI/CD to manage code. Why not apply the same logic to data? With open source tools like lakeFS, it’s possible to manage data at scale using Git-like capabilities.

Join expert Adi Polak to discover an innovative approach to data CI/CD principles and learn how to better manage your data lake with open format and open source tools. You’ll learn the challenges of data+code, how to take full advantage of modern object stores, the capabilities and internals of lakeFS, and how to leverage a Git-like workflow for improving overall working with data. By the end of the training, you'll be able to make better decisions when architecting your data-intensive systems, anticipate future data flow challenges, and build processes to solve these challenges.

What you’ll learn and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

How to take full advantage of modern object stores like S3, ADLS, MinIO, and their pros and cons
How to apply a Git-like workflow for your data
LakeFS capabilities and internals
How to run lakeFS locally with the CLI, UI, and from code

And you’ll be able to:

Make better decisions when architecting your data-intensive system
Use the tools presented here in your local environment
Anticipate future complex data flow challenges and build processes to solve them
Identify and recover from production data errors faster
Publish business-critical datasets confidently
Induce faster development cycles over data pipelines

This live event is for you because...

You’re a data architect, data engineer, or machine learning engineer.
You work with data lakes, data-intensive apps, data pipelines, and object storage.
You want to become better at architecting your system and overcoming data-intensive challenges.
You need better tools for working with your data and improving engineering processes.

Prerequisites

Basic understanding of data pipelines/flow
Familiarity with Git
Familiarity with engineering production and testing environments

Recommended preparation:

Read “Local Git” and “Remote Git” (chapters 1 and 2 in Git in Practice)
Explore the _lakeFS Video Tutorial Series _

Recommended follow-up:

Read Designing Data-Intensive Applications (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Data lakes and continuous principles (60 minutes)

Group discussion: Assessment of audience experience; What are the challenges we are facing?; the industry's adoption of CI/CD and Git for application lifecycle management
Presentation: Today’s cloud offerings for data lakes; application lifecycle management continuous principles with data lakes
Jupyter notebook: Simple Data Pipeline over a Data Lake with Spark
Q&A
Break

What is lakeFS? (60 minutes)

Presentation: LakeFS CLI; lakeFS internals
Jupyter notebooks: Work with lakeFS CLI; Use lakeFS with a Complex Data Flow
Q&A
Break

How to apply CI/CD to data (60 minutes)

Presentation: Continuous data integrations; continuous data deployment; Git-like capabilities on data
Jupyter notebook: Work with lakeFS UI
Q&A

Your Instructor

Adi Polak
Adi Polak is an official Databricks ambassador, author of the book Scaling Machine Learning with Spark, and a respected worldwide presenter. As a data practitioner, she developed algorithms to solve real-world problems using machine-learning techniques. As an engineer, she brought her hands-on ML experience into various Fortune 500 companies' products and services by building upon cutting-edge and emerging technologies.

linkedin link search

Skill covered

Data Lake

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills