4 DIMENSION REDUCTION

In this chapter, we describe the important step of dimension reduction. The dimension of a dataset, which is the number of variables, often must be reduced for the machine learning algorithms to operate efficiently. This process is part of the pilot/prototype phase of supervised learning and is done before deploying a model. We present and discuss several dimension reduction approaches: (1) incorporating domain knowledge to remove or combine categories, (2) using data summaries to detect information overlap between variables (and remove or combine redundant variables or categories), (3) using data conversion techniques such as converting categorical variables into numerical variables, and (4) employing automated reduction techniques such as principal component analysis (PCA), where a new set of variables (which are weighted averages of the original variables) is created. These new variables are uncorrelated, and a small subset of them usually contains most of their combined information (hence we can reduce dimension by using only a subset of the new variables). Finally, we mention supervised learning methods such as regression models and classification and regression trees, which can be used for removing redundant variables and for combining “similar” categories of categorical variables.

Dimension reduction in JMP: All the methods discussed in this chapter are available in the standard version of JMP.

4.1 INTRODUCTION

In machine learning, one often ...

Get Machine Learning for Business Analytics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.