August 2019
Intermediate to advanced
318 pages
4h 40m
English
This chapter will explore common preprocessing steps using this data:
>>>X2=pd.DataFrame(...{..."a":range(5),..."b":[-100,-50,0,200,1000],...}...)>>>X2a b0 0 -1001 1 -502 2 03 3 2004 4 1000
Some algorithms, such as SVM, perform better when the data is standardized. Each
column should have a mean value of 0 and standard deviation of 1.
Sklearn provides a .fit_transform method that combines both .fit and
.transform:
>>>fromsklearnimportpreprocessing>>>std=preprocessing.StandardScaler()>>>std.fit_transform(X2)array([[-1.41421356, -0.75995002],[-0.70710678, -0.63737744],[ 0. , -0.51480485],[ 0.70710678, -0.02451452],[ 1.41421356, 1.93664683]])
After fitting, there are various attributes we can inspect:
>>>std.scale_array([ 1.41421356, 407.92156109])>>>std.mean_array([ 2., 210.])>>>std.var_array([2.000e+00, 1.664e+05])
Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values:
>>>X_std=(X2-X2.mean())/X2.std()>>>X_stda b0 -1.264911 -0.6797201 -0.632456 -0.5700882 0.000000 -0.4604553 0.632456 -0.0219264 1.264911 1.732190>>>X_std.mean()a 4.440892e-17b 0.000000e+00dtype: float64>>>X_std.std()a 1.0b 1.0dtype: float64
The fastai library also implements this:
>>>X3=X2.copy()>>>fromfastai.structured ...