Chapter 6. Algorithm Chains and Pipelines
For many machine learning algorithms, the particular representation of
the data that you provide is very important, as we discussed in Chapter 4. This starts with scaling the data and combining features by hand and
goes all the way to learning features using unsupervised machine
learning, as we saw in Chapter 3. Consequently, most machine learning
applications require not only the application of a single algorithm, but
the chaining together of many different processing steps and machine
learning models. In this chapter, we will cover how to use the Pipeline
class to simplify the process of building chains of transformations and
models. In particular, we will see how we can combine Pipeline and
GridSearchCV to search over parameters for all processing steps at
once.
As an example of the importance of chaining models, we noticed that we
can greatly improve the performance of a kernel SVM on the cancer
dataset by using the MinMaxScaler for preprocessing. Here’s code for
splitting the data, computing the minimum and maximum, scaling the data, and
training the SVM:
In[1]:
fromsklearn.svmimportSVCfromsklearn.datasetsimportload_breast_cancerfromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportMinMaxScaler# load and split the datacancer=load_breast_cancer()X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)# compute minimum and maximum on the training ...