Chapter 5Preliminary data analysis and visualization

Data visualization is an important part of preliminary statistical analysis. Many data features, including skewness and heavy tails, the presence of outliers, clusters, and nonlinearity, can and should be seen and evaluated before starting statistical inference. The reason is that standard statistical models, such as the ‐test and linear regression, typically assume a normal distribution, absence of clusters, and linearity. The best way to detect possible violations is to see the data. The saying “a picture is worth a thousand words” is very applicable here. We start this chapter by showing how the empirical cumulative distribution function (cdf) can be used to visualize and compare iid samples of observations. The receiver operator characteristic (ROC) curve naturally emerges by plotting one cdf against another. Other visualization techniques such as histogram, q‐q and box plot, one‐ and two‐dimensional kernel density estimation will be discussed as well. At the end of the chapter, we apply visualization techniques to spatial data for disease mapping.

5.1 Comparison of random variables using the cdf

If and are numbers, it is ...

Get Advanced Statistics with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Advanced Statistics with Applications in R by Eugene Demidenko

Chapter 5Preliminary data analysis and visualization

5.1 Comparison of random variables using the cdf

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly