Chapter 6. Exploring

It has been said that it is easier to take a SME and train them in data science than the reverse. I’m not sure I agree with that 100%, but there is truth that data has nuance and an SME can help tease that apart. By understanding the business and the data, they are able to create better models and have a better impact on their business.

Before I create a model, I will do some exploratory data analysis. This gives me a feel for the data, but also is a great excuse to meet and discuss issues with business units that control that data.

Data Size

Again, we are using the Titanic dataset here. The pandas .shape property will return a tuple of the number of rows and columns:

>>> X.shape
(1309, 13)

We can see that this dataset has 1,309 rows and 13 columns.

Summary Stats

We can use pandas to get summary statistics for our data. The .describe method will also give us the count of non-NaN values. Let’s look at the results for the first and last columns:

>>> X.describe().iloc[:, [0, -1]]
            pclass   embarked_S
count  1309.000000  1309.000000
mean     -0.012831     0.698243
std       0.995822     0.459196
min      -1.551881     0.000000
25%      -0.363317     0.000000
50%       0.825248     1.000000
75%       0.825248     1.000000
max       0.825248     1.000000

The count row tells us that both of these columns are filled in. There are no missing values. We also have the mean, standard deviation, minimum, maximum, and quartile values.

Note

A pandas DataFrame has an iloc attribute that we can do index operations on. It will let us pick ...

Get Machine Learning Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Machine Learning Pocket Reference by Matt Harrison

Chapter 6. Exploring

Data Size

Summary Stats

Note

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly