O'Reilly logo
live online training icon Live Online training

Python and Dask: Scaling the Dataframe

Big Data in Python with Dask

Topic: Data
Daniel Gerlanc

Python's most popular data science libraries—pandas, numpy, and scikit-learn—were designed to run on a single computer, and in some cases, using a single processor. Whether this computer is a laptop or a server with 96 cores, your compute and memory are constrained by the size of the biggest computer you have access to.

In this course, you'll learn how to use Dask, a Python library for parallel and distributed computing, to bypass this constraint by scaling our compute and memory across multiple cores. Dask provides integrations with Python libraries like pandas, numpy, and scikit-learn so you can scale your computations without having to learn completely new libraries or significantly refactoring your code.

What you'll learn-and how you can apply it

  • Understand the options for installing and running Dask on your laptop or a cluster
  • When to use the different APIs provided by Dask
  • Accelerate data science and engineering workflows using Dask

This training course is for you because...

  • You have an intermediate level of experience with Python programming
  • You are a data scientist and want to be able to perform feature engineering and model fitting using multiple cores and on larger-than-memory datasets
  • You are a data engineer or software developer and need a modern, Python-based, task scheduler and distributed computing framework

Prerequisites

  • Intermediate-level programming ability in Python. Attendees should know the difference between a dict, list, and tuple. Familiarity with control-flow (if/else/for/while) and error handling (try/catch) are required.
  • Experience working with data frames in Python and Pandas or another language, e.g. (R, PySpark)

Course Set-up

  • Step-by-step instructions for setting up a working Python environment with using Anaconda are available here. You will need a working environment to complete the exercises in Jupyter notebook. Alternatively, you may view the notebooks here.

Recommended Preparation

Recommended Follow-up

About your instructor

  • Daniel Gerlanc is the Founder and President of EnPlus Advisors, a consultancy specializing in data science and custom software development. He started EnPlus in 2011 after working as a hedge fund quant for 5 years. At EnPlus, he focuses on projects that require expertise in both data analysis and software engineering. He has coauthored several open source R packages, published in peer-reviewed journals, and been an invited speaker at conferences including ODSC and PGConf. He is a graduate of Williams College.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment: Intro to Dask (40 min) - Training Overview - 10 Minutes to Dask - Why Dask? - Dask APIs

Segment 1 Exercises (15 min)

Instructor demonstrates solving exercise (10 min)

QA (5 min)

Break (10 min)

Segment: Dask Schedulers (40 min) - tldr: Use distributed except for testing and debugging - Distributed - Single Machine

Segment Exercises (15 min)

Instructor demonstrates solving exercise (10 min)

QA (5 min)

Break (10 min)

Segment: Profiling Dask (40 min) - Dask Distributed Dashboard - Visualizing Task Graphs - Debugging Dask

Segment Exercises (15 min)

Instructor demonstrates solving exercise (10 min)