Chapter 6. Exploring
It has been said that it is easier to take a SME and train them in data science than the reverse. I’m not sure I agree with that 100%, but there is truth that data has nuance and an SME can help tease that apart. By understanding the business and the data, they are able to create better models and have a better impact on their business.
Before I create a model, I will do some exploratory data analysis. This gives me a feel for the data, but also is a great excuse to meet and discuss issues with business units that control that data.
Again, we are using the Titanic dataset here. The pandas
.shape property will return a tuple of the number of rows and columns:
We can see that this dataset has 1,309 rows and 13 columns.
We can use pandas to get summary statistics for our data. The
.describe method will also give us the count of non-NaN values. Let’s
look at the results for the first and last columns:
count 1309.000000 1309.000000
mean -0.012831 0.698243
std 0.995822 0.459196
min -1.551881 0.000000
25% -0.363317 0.000000
50% 0.825248 1.000000
75% 0.825248 1.000000
max 0.825248 1.000000
The count row tells us that both of these columns are filled in. There are no missing values. We also have the mean, standard deviation, minimum, maximum, and quartile values.
A pandas DataFrame has an
iloc attribute that we can do index operations on. It will let us pick ...