First, we want to see how many individuals of each class we have. This is important, because if the class distribution is very imbalanced (like 1 to 100, for example), we will have problems training our classification models. You can get data frame columns via the dot notation. For example, df.label will return you the label column as a new data frame. The data frame class has all kinds of useful methods for calculating the summary statistics. The value_counts() method returns the counts of each element type in the data frame:
In []: df.label.value_counts() Out[]: platyhog 520 rabbosaurus 480 Name: label, dtype: int64
The class distribution looks okay for our purposes. Now let's explore the features.
We need to ...