Chapter 5. Exploratory Data Analysis with R and Python

Exploratory data analysis is an important preparation step that influences all subsequent steps in the business analytics cycle and ensures that models are built on a solid foundation of well-understood, appropriately processed data. EDA is used to understand the characteristics of data, identify errors and inconsistencies, uncover relationships between features, validate assumptions, and make informed decisions about which models may perform the best. This chapter will explore each of the major steps involved in EDA.

John Tukey, a pioneering statistician, played a crucial role in the development of EDA, emphasizing the importance of using visual methods to understand data. Tukey advocated for EDA as a way to uncover underlying patterns, spot anomalies, and test assumptions before applying formal statistical models. His approach encourages analysts to interact with data through visualization and summary statistics to gain insights and intuition about its structure and relationships. Tukey’s work laid the foundation for modern data analysis, highlighting the value of EDA in the initial stages of data understanding. The first step in the EDA process is exploring the quality of the data to be used in the analytics project, so we’ll start the chapter with this topic.

Data Quality

Data quality refers to the condition of a dataset. It’s particularly important in EDA because the value of the different data points depends on the ...

Get Modern Business Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.