Skip to Content
SQL for Data Analysis
book

SQL for Data Analysis

by Cathy Tanimura
September 2021
Beginner to intermediate
357 pages
9h 53m
English
O'Reilly Media, Inc.
Book available
Content preview from SQL for Data Analysis

Chapter 2. Preparing Data for Analysis

Estimates of how long data scientists spend preparing their data vary, but it’s safe to say that this step takes up a significant part of the time spent working with data. In 2014, the New York Times reported that data scientists spend from 50% to 80% of their time cleaning and wrangling their data. A 2016 survey by CrowdFlower found that data scientists spend 60% of their time cleaning and organizing data in order to prepare it for analysis or modeling work. Preparing data is such a common task that terms have sprung up to describe it, such as data munging, data wrangling, and data prep. (“Mung” is an acronym for Mash Until No Good, which I have certainly done on occasion.) Is all this data preparation work just mindless toil, or is it an important part of the process?

Data preparation is easier when a data set has a data dictionary, a document or repository that has clear descriptions of the fields, possible values, how the data was collected, and how it relates to other data. Unfortunately, this is frequently not the case. Documentation often isn’t prioritized, even by people who see its value, or it becomes out-of-date as new fields and tables are added or the way data is populated changes. Data profiling creates many of the elements of a data dictionary, so if your organization already has a data dictionary, this is a good time to use it and contribute to it. If no data dictionary exists currently, consider starting one! This is one ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

SQL for Data Analytics

SQL for Data Analytics

Upom Malik, Matt Goldwasser, Benjamin Johnston
Analytics Engineering with SQL and dbt

Analytics Engineering with SQL and dbt

Rui Pedro Machado, Helder Russa

Publisher Resources

ISBN: 9781492088776Errata PageSupplemental Content