Chapter 6. Exploring
It has been said that it is easier to take a SME and train them in data science than the reverse. I’m not sure I agree with that 100%, but there is truth that data has nuance and an SME can help tease that apart. By understanding the business and the data, they are able to create better models and have a better impact on their business.
Before I create a model, I will do some exploratory data analysis. This gives me a feel for the data, but also is a great excuse to meet and discuss issues with business units that control that data.
Data Size
Again, we are using the Titanic dataset here. The pandas .shape property will return a tuple of the number of rows and columns:
>>>X.shape(1309, 13)
We can see that this dataset has 1,309 rows and 13 columns.
Summary Stats
We can use pandas to get summary statistics for our data. The
.describe method will also give us the count of non-NaN values. Let’s
look at the results for the first and last columns:
>>>X.describe().iloc[:,[0,-1]]pclass embarked_Scount 1309.000000 1309.000000mean -0.012831 0.698243std 0.995822 0.459196min -1.551881 0.00000025% -0.363317 0.00000050% 0.825248 1.00000075% 0.825248 1.000000max 0.825248 1.000000
The count row tells us that both of these columns are filled in. There are no missing values. We also have the mean, standard deviation, minimum, maximum, and quartile values.
Note
A pandas DataFrame has an iloc attribute that we can do index operations on. It will let us pick ...