Skip to Main Content
Data Algorithms with Spark
book

Data Algorithms with Spark

by Mahmoud Parsian
April 2022
Intermediate to advanced content levelIntermediate to advanced
435 pages
9h 44m
English
O'Reilly Media, Inc.
Book available
Content preview from Data Algorithms with Spark

Chapter 12. Feature Engineering in PySpark

This chapter covers design patterns for working with features of data—any measurable attributes, from car prices to gene values, hemoglobin counts, or education levels—when building machine learning models (also known as feature engineering). These processes (extracting, transforming, and selecting features) are essential in building effective machine learning models. Feature engineering is one of the most important topics in machine learning, because the success or failure of a model at predicting the future depends mainly on the features you choose.

Spark provides a comprehensive machine learning API for many well-known algorithms including linear regression, logistic regression, and decision trees. The goal of this chapter is to present fundamental tools and techniques in PySpark that you can use to build all sorts of machine learning pipelines. The chapter introduces Spark’s powerful machine learning tools and utilities and provides examples using the PySpark API. The skills you learn here will be useful to an aspiring data scientist or data engineer. My goal is not to familiarize you with famous machine learning algorithms such as linear regression, principal component analysis, or support vector machines, since these are already covered in many books, but to equip you with tools (normalization, standardization, string indexing, etc.) that you can use in cleaning data and building models for a wide range of machine learning algorithms. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms

Data Algorithms

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781492082378Errata PageSupplemental Content