Skip to Content
Machine Learning Pocket Reference
book

Machine Learning Pocket Reference

by Matt Harrison
August 2019
Intermediate to advanced
318 pages
4h 40m
English
O'Reilly Media, Inc.
Book available
Content preview from Machine Learning Pocket Reference

Chapter 7. Preprocess Data

This chapter will explore common preprocessing steps using this data:

>>> X2 = pd.DataFrame(
...     {
...         "a": range(5),
...         "b": [-100, -50, 0, 200, 1000],
...     }
... )
>>> X2
   a     b
0  0  -100
1  1   -50
2  2     0
3  3   200
4  4  1000

Standardize

Some algorithms, such as SVM, perform better when the data is standardized. Each column should have a mean value of 0 and standard deviation of 1. Sklearn provides a .fit_transform method that combines both .fit and .transform:

>>> from sklearn import preprocessing
>>> std = preprocessing.StandardScaler()
>>> std.fit_transform(X2)
array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

After fitting, there are various attributes we can inspect:

>>> std.scale_
array([  1.41421356, 407.92156109])
>>> std.mean_
array([  2., 210.])
>>> std.var_
array([2.000e+00, 1.664e+05])

Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values:

>>> X_std = (X2 - X2.mean()) / X2.std()
>>> X_std
          a         b
0 -1.264911 -0.679720
1 -0.632456 -0.570088
2  0.000000 -0.460455
3  0.632456 -0.021926
4  1.264911  1.732190

>>> X_std.mean()
a    4.440892e-17
b    0.000000e+00
dtype: float64

>>> X_std.std()
a    1.0
b    1.0
dtype: float64

The fastai library also implements this:

>>> X3 = X2.copy()
>>> from fastai.structured ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Practical Simulations for Machine Learning

Practical Simulations for Machine Learning

Paris Buttfield-Addison, Mars Buttfield-Addison, Tim Nugent, Jon Manning

Publisher Resources

ISBN: 9781492047537Errata PageSupplemental Content