The Data Engineering Toolkit from Notebook to Production

Published by Pearson

Beginner to intermediate

Take a data project from prototype to production quality with the modern data stack

Get a complete, high-level overview of today’s data engineering tools
Learn best practices for data architecture and design
Benefit whether you are new to data engineering or a practicing data engineer looking to keep up with current tools and methods

What is this training about, and why is it important?

A high-level introduction to all of Data Engineering in three hours. In this fast-paced survey course, you’ll learn how to take a data project from prototype to production quality data platform. We’ll explore techniques to scale up to big data sets and scale out to more features and larger teams. We'll discuss data architecture and design principles while exploring up-and-coming data tools.

What you’ll learn and how you can apply it

By the end of the live online course, you’ll understand:

Essential tools and design principles for the modern data stack.
How to improve performance by matching storage format to query type.
Strategies for data observability and quality.

And you’ll be able to:

Accelerate data warehouse development by writing reusable SQL with dbt
Reduce time-to-market by running Jupyter notebooks in production with Papermill
Build robust, reliable data science pipelines with Apache Airflow

This live event is for you because...

Novice and experienced data and MLOps engineers will learn cutting edge best practices in this rapidly changing field.
Back-end engineers transitioning to data engineering will gain critical skills to work effectively on data intensive applications.
Data scientists, analysts, and analytics engineers will understand the data platform development process.
Chief Data Officers (CDOs) will gain necessary technical background to oversee data projects.
DevOps engineers will be able to deploy and monitor data systems.

Prerequisites

Familiarity with software engineering and architecture in the cloud environment.
Comfort with Python, SQL and Jupyter Notebooks is helpful but not required.
Some experience with back-end development, data science or databases is required. This is not an introductory course for new developers.

Course Set-up

No setup is required. Links to explore the recommended open source data engineering tools will be included in the slide deck.

Recommended Preparation

Read: Data Pipelines Pocket Reference by James Densmore - optional

Recommended Follow-up

Watch: Hadoop and Spark Fundamentals by Douglas Eadline
Read: Fundamentals of Data Engineering by Joe Reis and Matt Housley

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Data Modeling (50 min)

Course Overview
Principles of Data Engineering: Review of design principles for architecting solutions on the modern data stack.
ETL or ELT? When ingesting data, store raw data early for later transformation. Use Meltano to extract data from cloud APIs and Benthos to load and hydrate data.
Data Lakes: Data lakes are the original source of truth and the input/output location for data pipelines. Choose a storage format that matches query types for improved performance.
Data Warehouses: Data cubes are dead. Long live the distributed data warehouse. Effective ways to use these workhorse data stores. Why all SQL is not the same.
dbt: Closer look at dbt, an exciting new tool for SQL-based data transformation that’s transforming data teams.

Q&A: 5 minutes

5-minute break

Segment 2: Data Flow (50 min)

Data Pipelines: Complex processing and transformation beyond what’s possible in a SQL-based data warehouse. Use Pandas APIs in Spark with Koalas. Use Papermill to run Jupyter Notebooks in production.
Dashboards and Discovery: Dashboards are the UI of data systems. Build custom dashboards with Apache Superset. Communicate data status and timeliness to end users.
Workflow Orchestration: How does anything happen? Workflow orchestration is a better approach than fragile and frustrating cron jobs.
Apache Airflow: A closer look at Apache Airflow, the leading orchestration engine.
Streaming: The many varieties of “real time” and the (limited) role of streaming systems such as Kafka.

Q&A: 5 minutes

5-minute break

Segment 3: Data in Production (50 min)

Notebooks in Production: Use Papermill to run a Jupyter Notebook like a command line script and nbdev to create libraries.
Data Quality: Bad data is toxic and invisible. Maintain data quality with Great Expectations and deequ.
Testing: How to test software when there are no right answers. Write better tests with Coverage. Test DataFrames with pandera.
Debugging and Performance Tuning: Long iterations and lack of reproducibility make debugging hard. Data lineage can help. Be a scientist when tuning performance.
Operations: Achieve reproducibility with data CI/CD. Practice DevOps, not ClickOps!
Machine Learning Operations and Data Engineering: More similar than you might think. Monitor models with Evidently AI and track experiments with MLflow.
The Future of Data: Emerging trends in data systems, including closing the loop with Reverse ETL.

Q&A: 10 minutes

Your Instructor

Pete Fein
Pete Fein, principal at Snakedev, is an interdisciplinary consultant with 12 years experience solving hard problems for technical clients. He's a solutions architect and subject matter expert in data engineering who has been programming in Python for over 20 years. Pete is supported by an extensive network of collaborators in the Python and data communities. He is currently writing a book, Principles of Data Engineering, for Pearson (due out in 2023) and teaches workshops for private clients and on the O'Reilly platform. More info at snake.dev.

linkedin link search

Skill covered

Data Engineering

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills