4Exploratory Data Analysis
- Definition of EDA
- Box Plot and Five Numbers
4.1 Group by Analysis
When a numeric quantity is summarized across various levels of a factor or categorical variable, that is known as a group by Analysis Numerical.
- Summaries of numerical variables can be done by describe, group by commands.
- Categorical
- Summaries are best done by cross tab, group by operations.
- Datetime
- Datetime data is best handled by datetime library.
4.2 Numerical Data
Let’s take some data in http://nbviewer.jupyter.org/gist/decisionstats/4142e98375445c5e4174 (Figure 4.1).
For numerical data Describe command in pandas acts the same was as summary command in R for numerical data. Describe in Python Pandas gives you count, mean std min 25% 50% 75% max. Summary in R gives you mean, median, 25th and 75th quartiles, min, max.
There is another function in R called fivenum, and it gives you Tukey’s five numbers for exploratory data analysis (min, lower-hinge, median, upper-hinge, max).
R has a better function in the Hmisc package called describe (yes it can be confusing to go back and forth between pandas and R). Hmisc::Describe gives you a more elaborate numerical exploration (n,missing unique, Mean, .05,.10,.25,.50,.75,.90,.95 and 5 lowest and 5 highest scores). In Python we can do it using quantiles for percentiles (Figure 4.2).
Get Python for R Users now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.