Skip to Content
View all events

CI/CD for Data Lakes

Published by O'Reilly Media, Inc.

Intermediate to advanced content levelIntermediate to advanced

Managing your data like code

This live event utilizes Jupyter Notebook technology

Today, data lakes offer many advantages for cloud users. Everyone who needs storage such as ObjectStore, S3, Azure Blob, and others, can leverage data lakes. They are scalable, cost-effective, relatively easy to use, and have high throughput and a rich application ecosystem. Yet, with data-intensive systems that are a combination of OSS and cloud native services, there are also challenges. As data practitioners, it's tough to experiment, compare, and reproduce data-intensive transactions. Copying large-scale data for experimentation gets pricey. On top of the expense is the difficulty of enforcing data best practices like schema, since it can change on the fly when ingesting data from outside sources. Lastly, it’s also hard to ensure high-quality data. To start working on solutions to these problems, it’s necessary to acknowledge that our systems are made up of data and code. There are tools such as Git and CI/CD to manage code. Why not apply the same logic to data? With open source tools like lakeFS, it’s possible to manage data at scale using Git-like capabilities.

Join expert Adi Polak to discover an innovative approach to data CI/CD principles and learn how to better manage your data lake with open format and open source tools. You’ll learn the challenges of data+code, how to take full advantage of modern object stores, the capabilities and internals of lakeFS, and how to leverage a Git-like workflow for improving overall working with data. By the end of the training, you'll be able to make better decisions when architecting your data-intensive systems, anticipate future data flow challenges, and build processes to solve these challenges.

What you’ll learn and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to take full advantage of modern object stores like S3, ADLS, MinIO, and their pros and cons
  • How to apply a Git-like workflow for your data
  • LakeFS capabilities and internals
  • How to run lakeFS locally with the CLI, UI, and from code

And you’ll be able to:

  • Make better decisions when architecting your data-intensive system
  • Use the tools presented here in your local environment
  • Anticipate future complex data flow challenges and build processes to solve them
  • Identify and recover from production data errors faster
  • Publish business-critical datasets confidently
  • Induce faster development cycles over data pipelines

This live event is for you because...

  • You’re a data architect, data engineer, or machine learning engineer.
  • You work with data lakes, data-intensive apps, data pipelines, and object storage.
  • You want to become better at architecting your system and overcoming data-intensive challenges.
  • You need better tools for working with your data and improving engineering processes.

Prerequisites

  • Basic understanding of data pipelines/flow
  • Familiarity with Git
  • Familiarity with engineering production and testing environments

Recommended preparation:

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Data lakes and continuous principles (60 minutes)

  • Group discussion: Assessment of audience experience; What are the challenges we are facing?; the industry's adoption of CI/CD and Git for application lifecycle management
  • Presentation: Today’s cloud offerings for data lakes; application lifecycle management continuous principles with data lakes
  • Jupyter notebook: Simple Data Pipeline over a Data Lake with Spark
  • Q&A
  • Break

What is lakeFS? (60 minutes)

  • Presentation: LakeFS CLI; lakeFS internals
  • Jupyter notebooks: Work with lakeFS CLI; Use lakeFS with a Complex Data Flow
  • Q&A
  • Break

How to apply CI/CD to data (60 minutes)

  • Presentation: Continuous data integrations; continuous data deployment; Git-like capabilities on data
  • Jupyter notebook: Work with lakeFS UI
  • Q&A

Your Instructor

  • Adi Polak

    Adi Polak is an official Databricks ambassador, author of the book Scaling Machine Learning with Spark, and a respected worldwide presenter. As a data practitioner, she developed algorithms to solve real-world problems using machine-learning techniques. As an engineer, she brought her hands-on ML experience into various Fortune 500 companies' products and services by building upon cutting-edge and emerging technologies.

    linkedinXlinksearch

Skill covered

Data Lake