Data Engineering for Data Scientists
Build resilient pipelines to support stronger models
Topic: Data

Organizations, big and small, are making significant investments in data science. New data hires are told to model anything and everything so that the organization might find a competitive edge. The problem is that few companies are investing enough in infrastructure or hiring the number of data engineers required to support modeling efforts. As a result, data scientists arrive on the scene and quickly burn out because they can’t build the models they want to build… because the pipelines don’t exist.
In this course, Max Humber will teach you how to build resilient pipelines with industry-leading tools. Specifically, this course will introduce Airflow (the open source standard for automating data pipeline workflows), Python Fire (a library for automatically generating command line interfaces), and scikit-learn/pandas (used for data wrangling). All so that you can get back to actual data science!
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- How to build data pipelines
- How to monitor the performance of your models
- How to validate data before passing it to your models
And you’ll be able to:
- Productionize the outputs of your data models
- Author and execute Airflow jobs
- Write Airflow compatible SQL and Python scripts
This training course is for you because...
- You are a new Data Engineer or a Data Scientist on a small team
- You work with machine learning models
- You want your models to be supported by industry leading tools
Prerequisites
- Experience with pandas, scikit-learn, and at least some experience with SQL databases.
- Optionally, it may be helpful to have ownership over models running in production.
Recommended preparation:
- Install Airflow on your local machine before the course begins.
Recommended follow-up:
- Read Architecting Modern Data Platforms (book)
About your instructor
-
Max Humber is a distinguished faculty member at General Assembly and the author of Personal Finance with Python. Previously, he was the first data scientist at Borrowell and the second data engineer at Wealthsimple.
Schedule
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (5 minutes)
- Who am I and who are you?
- Poll: Machine learning models in production? Ownership? # of DS/DEs on your team?
- Introduce the “Data Hierarchy of Needs”
- Learning agenda
Model Extending (50 minutes)
- Migrate code from Jupyter Notebooks to Python Scripts
- Exercise: Make models “command-line compatible” with Python Fire
- Protect against invalid data with DataFrameMapper
- Solve the “Hamburger Emoji” Problem
- Add model performance logging with Rollbar
- Q&A
- Break (5 minutes)
Model Saving (20 minutes)
- Move away from flat csv files
- Query SQL data with pandas
- Introduce python-dotenv for managing secrets
- Exercise: Write model results to SQL using pandas
- Q&A
Model Scheduling (40 minutes)
- Configure Airflow
- Author and execute Airflow jobs
- Exercise: Move SQL and Python modeling scripts over to an Airflow job
- Monitor the schedule and job performance
- Q&A