CHAPTER 9Exploratory Data Analysis with SQL

Exploratory Data Analysis (EDA) is often discussed in a data science context as a first step in the predictive modeling process, when a data scientist explores what the data in a provided dataset looks like prior to using it to build a predictive model. The SQL we'll be using in this chapter could be used at that point in the process, to explore an already-prepared dataset. But what if you don't have a dataset to work with yet?

Here we'll show examples that could occur even earlier in the data pipeline, as we explore raw data straight from the database tables (as opposed to an already-aggregated dataset in which the raw data has been combined and transformed using SQL that is ready to be ingested into a model). If you are given access to a database for the first time, these are the types of queries you can run to familiarize yourself with the tables and data in it.

There are of course many ways to conduct EDA, including in a Jupyter notebook with Python code, in a Tableau workbook, or using SQL. (I regularly do all three in my job as a data scientist.) In the later EDA, once a dataset has been prepared, the focus is often on distributions of values, relationships between columns, and identifying correlations between input features and the target variable (column with values to be predicted by the model). Here, we will use the types of queries we've covered so far in this book to explore some tables in the Farmer's Market database, ...

Get SQL for Data Scientists now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.