Data preprocessing

We are now able to import datasets, even a big, problematic ones. Now, we need to learn the basic preprocessing routines in order to make it feasible for the next data science step.

First, if you need to apply a function to a limited section of rows, you can create a mask. A mask is a series of Boolean values (that is, True or False) that tells you whether the line is selected or not.

For example, let's say we want to select all the lines of the Iris dataset that have a sepal length greater than 6. We can simply do the following:

In: mask_feature = iris['sepal_length'] > 6.0In: mask_featureOut:   0     False       1     False     ...     146     True     147     True     148     True     149    False

In the preceding simple example, we can immediately see which observations ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.