SQL for Data Analysis

Book description

With the explosion of data, computing power, and cloud data warehouses, SQL has become an even more indispensable tool for the savvy analyst or data scientist. This practical book reveals new and hidden ways to improve your SQL skills, solve problems, and make the most of SQL as part of your workflow.

You'll learn how to use both common and exotic SQL functions such as joins, window functions, subqueries, and regular expressions in new, innovative ways--as well as how to combine SQL techniques to accomplish your goals faster, with understandable code. If you work with SQL databases, this is a must-have reference.

  • Learn the key steps for preparing your data for analysis
  • Perform time series analysis using SQL's date and time manipulations
  • Use cohort analysis to investigate how groups change over time
  • Use SQL's powerful functions and operators for text analysis
  • Detect outliers in your data and replace them with alternate values
  • Establish causality using experiment analysis, also known as A/B testing

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Analysis with SQL
    1. What Is Data Analysis?
    2. Why SQL?
      1. What Is SQL?
      2. Benefits of SQL
      3. SQL Versus R or Python
      4. SQL as Part of the Data Analysis Workflow
    3. Database Types and How to Work with Them
      1. Row-Store Databases
      2. Column-Store Databases
      3. Other Types of Data Infrastructure
    4. Conclusion
  3. 2. Preparing Data for Analysis
    1. Types of Data
      1. Database Data Types
      2. Structured Versus Unstructured
      3. Quantitative Versus Qualitative Data
      4. First-, Second-, and Third-Party Data
      5. Sparse Data
    2. SQL Query Structure
    3. Profiling: Distributions
      1. Histograms and Frequencies
      2. Binning
      3. n-Tiles
    4. Profiling: Data Quality
      1. Detecting Duplicates
      2. Deduplication with GROUP BY and DISTINCT
    5. Preparing: Data Cleaning
      1. Cleaning Data with CASE Transformations
      2. Type Conversions and Casting
      3. Dealing with Nulls: coalesce, nullif, nvl Functions
      4. Missing Data
    6. Preparing: Shaping Data
      1. For Which Output: BI, Visualization, Statistics, ML
      2. Pivoting with CASE Statements
      3. Unpivoting with UNION Statements
      4. pivot and unpivot Functions
    7. Conclusion
  4. 3. Time Series Analysis
    1. Date, Datetime, and Time Manipulations
      1. Time Zone Conversions
      2. Date and Timestamp Format Conversions
      3. Date Math
      4. Time Math
      5. Joining Data from Different Sources
    2. The Retail Sales Data Set
    3. Trending the Data
      1. Simple Trends
      2. Comparing Components
      3. Percent of Total Calculations
      4. Indexing to See Percent Change over Time
    4. Rolling Time Windows
      1. Calculating Rolling Time Windows
      2. Rolling Time Windows with Sparse Data
      3. Calculating Cumulative Values
    5. Analyzing with Seasonality
      1. Period-over-Period Comparisons: YoY and MoM
      2. Period-over-Period Comparisons: Same Month Versus Last Year
      3. Comparing to Multiple Prior Periods
    6. Conclusion
  5. 4. Cohort Analysis
    1. Cohorts: A Useful Analysis Framework
    2. The Legislators Data Set
    3. Retention
      1. SQL for a Basic Retention Curve
      2. Adjusting Time Series to Increase Retention Accuracy
      3. Cohorts Derived from the Time Series Itself
      4. Defining the Cohort from a Separate Table
      5. Dealing with Sparse Cohorts
      6. Defining Cohorts from Dates Other Than the First Date
    4. Related Cohort Analyses
      1. Survivorship
      2. Returnship, or Repeat Purchase Behavior
      3. Cumulative Calculations
    5. Cross-Section Analysis, Through a Cohort Lens
    6. Conclusion
  6. 5. Text Analysis
    1. Why Text Analysis with SQL?
      1. What Is Text Analysis?
      2. Why SQL Is a Good Choice for Text Analysis
      3. When SQL Is Not a Good Choice
    2. The UFO Sightings Data Set
    3. Text Characteristics
    4. Text Parsing
    5. Text Transformations
    6. Finding Elements Within Larger Blocks of Text
      1. Wildcard Matches: LIKE, ILIKE
      2. Exact Matches: IN, NOT IN
      3. Regular Expressions
    7. Constructing and Reshaping Text
      1. Concatenation
      2. Reshaping Text
    8. Conclusion
  7. 6. Anomaly Detection
    1. Capabilities and Limits of SQL for Anomaly Detection
    2. The Data Set
    3. Detecting Outliers
      1. Sorting to Find Anomalies
      2. Calculating Percentiles and Standard Deviations to Find Anomalies
      3. Graphing to Find Anomalies Visually
    4. Forms of Anomalies
      1. Anomalous Values
      2. Anomalous Counts or Frequencies
      3. Anomalies from the Absence of Data
    5. Handling Anomalies
      1. Investigation
      2. Removal
      3. Replacement with Alternate Values
      4. Rescaling
    6. Conclusion
  8. 7. Experiment Analysis
    1. Strengths and Limits of Experiment Analysis with SQL
    2. The Data Set
    3. Types of Experiments
      1. Experiments with Binary Outcomes: The Chi-Squared Test
      2. Experiments with Continuous Outcomes: The t-Test
    4. Challenges with Experiments and Options for Rescuing Flawed Experiments
      1. Variant Assignment
      2. Outliers
      3. Time Boxing
      4. Repeated Exposure Experiments
    5. When Controlled Experiments Aren’t Possible: Alternative Analyses
      1. Pre-/Post-Analysis
      2. Natural Experiment Analysis
      3. Analysis of Populations Around a Threshold
    6. Conclusion
  9. 8. Creating Complex Data Sets for Analysis
    1. When to Use SQL for Complex Data Sets
      1. Advantages of Using SQL
      2. When to Build into ETL Instead
      3. When to Put Logic in Other Tools
    2. Code Organization
      1. Commenting
      2. Capitalization, Indentation, Parentheses, and Other Formatting Tricks
      3. Storing Code
    3. Organizing Computations
      1. Understanding Order of SQL Clause Evaluation
      2. Subqueries
      3. Temporary Tables
      4. Common Table Expressions
      5. grouping sets
    4. Managing Data Set Size and Privacy Concerns
      1. Sampling with %, mod
      2. Reducing Dimensionality
      3. PII and Data Privacy
    5. Conclusion
  10. 9. Conclusion
    1. Funnel Analysis
    2. Churn, Lapse, and Other Definitions of Departure
    3. Basket Analysis
    4. Resources
      1. Books and Blogs
      2. Data Sets
    5. Final Thoughts
  11. Index

Product information

  • Title: SQL for Data Analysis
  • Author(s): Cathy Tanimura
  • Release date: September 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492088783