Chapter 12. Feature Engineering in PySpark

This chapter covers design patterns for working with features of data—any measurable attributes, from car prices to gene values, hemoglobin counts, or education levels—when building machine learning models (also known as feature engineering). These processes (extracting, transforming, and selecting features) are essential in building effective machine learning models. Feature engineering is one of the most important topics in machine learning, because the success or failure of a model at predicting the future depends mainly on the features you choose.

Spark provides a comprehensive machine learning API for many well-known algorithms including linear regression, logistic regression, and decision trees. The goal of this chapter is to present fundamental tools and techniques in PySpark that you can use to build all sorts of machine learning pipelines. The chapter introduces Spark’s powerful machine learning tools and utilities and provides examples using the PySpark API. The skills you learn here will be useful to an aspiring data scientist or data engineer. My goal is not to familiarize you with famous machine learning algorithms such as linear regression, principal component analysis, or support vector machines, since these are already covered in many books, but to equip you with tools (normalization, standardization, string indexing, etc.) that you can use in cleaning data and building models for a wide range of machine learning algorithms. ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.