4Exploratory Data Analysis

  • Definition of EDA
  • Box Plot and Five Numbers

4.1 Group by Analysis

When a numeric quantity is summarized across various levels of a factor or categorical variable, that is known as a group by Analysis Numerical.

  • Summaries of numerical variables can be done by describe, group by commands.
  • Categorical
  • Summaries are best done by cross tab, group by operations.
  • Datetime
  • Datetime data is best handled by datetime library.

4.2 Numerical Data

Let’s take some data in http://nbviewer.jupyter.org/gist/decisionstats/4142e98375445c5e4174 (Figure 4.1).

No alt text required.

Figure 4.1 Describe function.

For numerical data Describe command in pandas acts the same was as summary command in R for numerical data. Describe in Python Pandas gives you count, mean std min 25% 50% 75% max. Summary in R gives you mean, median, 25th and 75th quartiles, min, max.

There is another function in R called fivenum, and it gives you Tukey’s five numbers for exploratory data analysis (min, lower-hinge, median, upper-hinge, max).

R has a better function in the Hmisc package called describe (yes it can be confusing to go back and forth between pandas and R). Hmisc::Describe gives you a more elaborate numerical exploration (n,missing unique, Mean, .05,.10,.25,.50,.75,.90,.95 and 5 lowest and 5 highest scores). In Python we can do it using quantiles for percentiles (Figure 4.2).

Figure 4.2 Quantile ...

Get Python for R Users now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.