Section 2: Data Science
Once we have clean data in a data lake, we can get started with performing data science and machine learning on the historical data. This section helps you understand the importance and need for scalable machine learning. The chapters in this section show how to perform exploratory data analysis, feature engineering, and machine learning model training in a scalable and distributed fashion using PySpark. This section also introduces MLflow, an open source machine learning life cycle management tool useful for tracking machine learning experiments and productionizing machine learning models. This section also introduces you to some techniques for scaling out single-machine machine learning libraries based on standard Python. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access