Techniques for Data Science with Big Datasets
Effective data at scale with a new generation of open tools
As Apache Spark matures and moves towards legacy status, data scientists have new opportunities and challenges in dealing with large-scale data, without a single all-encompassing tool.
Join expert Adam Breindel to get hands-on practice with a suite of easier, more data-science-friendly tools that work for your team and let you build more future-proof systems. You’ll learn new tools for working with big datasets, like Dask and Ray; understand how to continue to make the best use of existing tools, like SparkSQL, Presto, and Kafka; and discover how to choose and combine the right tools for feature extraction, feature engineering, modeling, tuning, and other data science tasks.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- The world of Python-friendly large-scale data science tools including Dask and Ray
- Areas where existing tools like SparkSQL, Presto, and Kafka are still critical
- How to combine these tools and their data formats into sensible systems
And you’ll be able to:
- Take on large-scale data science work with Python tools, even if your data is in Hive or Spark tables
- Choose the right tools for feature extraction, feature engineering, modeling, and tuning
- Cut through the hype and explain the strengths and weaknesses of key open source projects
This training course is for you because...
- You’re a data scientist, team lead, manager, or architect in charge of data science projects at scale.
- You’ve relied on Apache Spark or home-built solutions in the past and would like to learn or migrate to newer best-of-breed tools.
- You’re already using a Python big data tool like Dask or Ray, and you’d like to fill in the gaps to cover more use cases.
- A basic understanding of the data science lifecycle in enterprise settings
- Familiarity with the challenges of big data systems (useful but not required)
- Read relevant sections of Kafka: The Definitive Guide, second edition and Presto: The Definitive Guide to close up any gaps in your knowledge
- Think about the challenges you’ve faced with data science at scale and be ready to discuss any question you have in class
- Take Kafka Fundamentals (live online training course with Petter Graff)
- Watch Meet the Expert: Dean Wampler on Scaling ML/AI Applications with Ray (video, 57m)
About your instructor
Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.
The timeframes are only estimates and may vary according to how the class is progressing
Welcome to large-scale data science in 2020! What’s changed? (15 minutes)
- Presentation and group discussion: Data science in 2016—What to do with data? R, Python, and Spark; 2017—the broad rise of deep learning; 2018–2019—the decline of Hadoop/Spark for data science; 2020—new open tools and hybrid architectures
Interactive survey: Current architectures (10 minutes)
- Presentation and discussion: Dataset size; data storage; SQL and extraction; feature engineering and modeling
Changing definition of large-scale data (10 minutes)
- Presentation and discussion: The largest dataset for ML versus the largest dataset tractable on a single node; new definitions for small, medium, and “big” data
Obtaining data (feature extraction) (35 minutes)
- Presentation and discussion: Legacy tools (e.g., Hive); current tools (e.g., SparkSQL, Presto); future tools (e.g., Kartothek, BlazingSQL); easiest compromises for working in the real world; working without SQL?
- Hands-on exercise: Try SparkSQL
Feature engineering (40 minutes)
- Presentation and demos: The DataFrame model; from pandas to Dask; the array model; Dask Array
- Hands-on exercise: DataFrame and Array at scale
Break (10 minutes)
Modeling (55 minutes)
- Presentation and demos: Unsupervised learning and dimensionality reduction with Dask, Ray, and Horovod; classic ML with Dask, XGBoost, and GPU acceleration; deep learning with RaySGD, Ray RLlib, or Horovod; simulations and agent-based models with Ray
- Hands-on exercise: Modeling with Dask and Ray
Break (10 minutes)
Model tuning (25 minutes)
- Presentation and demos: RayTune—easy access to state-of-the-art scale-out tuning; tuning options and lite AutoML with Dask
- Hands-on exercise: RayTune
Model scoring, online learning, and orchestration (30 minutes)
- Presentation and demos: Batch, request/response, and streaming scoring; cached scoring; real versus “near real” online learning; Airflow and Prefect for orchestration