Chapter 2. A Single Variable: Shape and Distribution

WHEN DEALING WITH UNIVARIATE DATA, WE ARE USUALLY MOSTLY CONCERNED WITH THE OVERALL SHAPE OF the distribution. Some of the initial questions we may ask include:

  • Where are the data points located, and how far do they spread? What are typical, as well as minimal and maximal, values?

  • How are the points distributed? Are they spread out evenly or do they cluster in certain areas?

  • How many points are there? Is this a large data set or a relatively small one?

  • Is the distribution symmetric or asymmetric? In other words, is the tail of the distribution much larger on one side than on the other?

  • Are the tails of the distribution relatively heavy (i.e., do many data points lie far away from the central group of points), or are most of the points—with the possible exception of individual outliers—confined to a restricted region?

  • If there are clusters, how many are there? Is there only one, or are there several? Approximately where are the clusters located, and how large are they—both in terms of spread and in terms of the number of data points belonging to each cluster?

  • Are the clusters possibly superimposed on some form of unstructured background, or does the entire data set consist only of the clustered data points?

  • Does the data set contain any significant outliers—that is, data points that seem to be different from all the others?

  • And lastly, are there any other unusual or significant features in the data set—gaps, sharp cutoffs, unusual values, ...

Get Data Analysis with Open Source Tools now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.