October 2021
Beginner to intermediate
322 pages
7h 27m
English
Once we have clean data in a data lake, we can get started with performing data science and machine learning on the historical data. This section helps you understand the importance and need for scalable machine learning. The chapters in this section show how to perform exploratory data analysis, feature engineering, and machine learning model training in a scalable and distributed fashion using PySpark. This section also introduces MLflow, an open source machine learning life cycle management tool useful for tracking machine learning experiments and productionizing machine learning models. This section also introduces you to some techniques for scaling out single-machine machine learning libraries based on standard Python. ...