Chapter 6. Algorithm Chains and Pipelines
For many machine learning algorithms, the particular representation of
the data that you provide is very important, as we discussed in Chapter 4. This starts with scaling the data and combining features by hand and
goes all the way to learning features using unsupervised machine
learning, as we saw in Chapter 3. Consequently, most machine learning
applications require not only the application of a single algorithm, but
the chaining together of many different processing steps and machine
learning models. In this chapter, we will cover how to use the Pipeline
class to simplify the process of building chains of transformations and
models. In particular, we will see how we can combine Pipeline
and
GridSearchCV
to search over parameters for all processing steps at
once.
As an example of the importance of chaining models, we noticed that we
can greatly improve the performance of a kernel SVM on the cancer
dataset by using the MinMaxScaler
for preprocessing. Here’s code for
splitting the data, computing the minimum and maximum, scaling the data, and
training the SVM:
In[1]:
from
sklearn.svm
import
SVC
from
sklearn.datasets
import
load_breast_cancer
from
sklearn.model_selection
import
train_test_split
from
sklearn.preprocessing
import
MinMaxScaler
# load and split the data
cancer
=
load_breast_cancer
()
X_train
,
X_test
,
y_train
,
y_test
=
train_test_split
(
cancer
.
data
,
cancer
.
target
,
random_state
=
0
)
# compute minimum and maximum on the training ...
Get Introduction to Machine Learning with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.