Chapter 6. Exploring

It has been said that it is easier to take a SME and train them in data science than the reverse. I’m not sure I agree with that 100%, but there is truth that data has nuance and an SME can help tease that apart. By understanding the business and the data, they are able to create better models and have a better impact on their business.

Before I create a model, I will do some exploratory data analysis. This gives me a feel for the data, but also is a great excuse to meet and discuss issues with business units that control that data.

Data Size

Again, we are using the Titanic dataset here. The pandas .shape property will return a tuple of the number of rows and columns:

>>> X.shape
(1309, 13)

We can see that this dataset has 1,309 rows and 13 columns.

Summary Stats

We can use pandas to get summary statistics for our data. The .describe method will also give us the count of non-NaN values. Let’s look at the results for the first and last columns:

>>> X.describe().iloc[:, [0, -1]]
            pclass   embarked_S
count  1309.000000  1309.000000
mean     -0.012831     0.698243
std       0.995822     0.459196
min      -1.551881     0.000000
25%      -0.363317     0.000000
50%       0.825248     1.000000
75%       0.825248     1.000000
max       0.825248     1.000000

The count row tells us that both of these columns are filled in. There are no missing values. We also have the mean, standard deviation, minimum, maximum, and quartile values.

Note

A pandas DataFrame has an iloc attribute that we can do index operations on. It will let us pick ...

Get Machine Learning Pocket Reference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.