Chapter 7. Preprocess Data
This chapter will explore common preprocessing steps using this data:
>>>
X2
=
pd
.
DataFrame
(
...
{
...
"a"
:
range
(
5
),
...
"b"
:
[
-
100
,
-
50
,
0
,
200
,
1000
],
...
}
...
)
>>>
X2
a b
0 0 -100
1 1 -50
2 2 0
3 3 200
4 4 1000
Standardize
Some algorithms, such as SVM, perform better when the data is standardized. Each
column should have a mean value of 0 and standard deviation of 1.
Sklearn provides a .fit_transform
method that combines both .fit
and
.transform
:
>>>
from
sklearn
import
preprocessing
>>>
std
=
preprocessing
.
StandardScaler
()
>>>
std
.
fit_transform
(
X2
)
array([[-1.41421356, -0.75995002],
[-0.70710678, -0.63737744],
[ 0. , -0.51480485],
[ 0.70710678, -0.02451452],
[ 1.41421356, 1.93664683]])
After fitting, there are various attributes we can inspect:
>>>
std
.
scale_
array([ 1.41421356, 407.92156109])
>>>
std
.
mean_
array([ 2., 210.])
>>>
std
.
var_
array([2.000e+00, 1.664e+05])
Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values:
>>>
X_std
=
(
X2
-
X2
.
mean
())
/
X2
.
std
()
>>>
X_std
a b
0 -1.264911 -0.679720
1 -0.632456 -0.570088
2 0.000000 -0.460455
3 0.632456 -0.021926
4 1.264911 1.732190
>>>
X_std
.
mean
()
a 4.440892e-17
b 0.000000e+00
dtype: float64
>>>
X_std
.
std
()
a 1.0
b 1.0
dtype: float64
The fastai library also implements this:
>>>
X3
=
X2
.
copy
()
>>>
from
fastai.structured ...
Get Machine Learning Pocket Reference now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.