book

SQL for Data Analysis

by Cathy Tanimura

September 2021

Beginner to intermediate

357 pages

9h 53m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Includes

Has Sandbox

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is Data Analysis?Why SQL?What Is SQL?Benefits of SQLSQL Versus R or PythonSQL as Part of the Data Analysis WorkflowDatabase Types and How to Work with ThemRow-Store DatabasesColumn-Store DatabasesOther Types of Data InfrastructureConclusion
Types of DataDatabase Data TypesStructured Versus UnstructuredQuantitative Versus Qualitative DataFirst-, Second-, and Third-Party DataSparse DataSQL Query StructureProfiling: DistributionsHistograms and FrequenciesBinningn-TilesProfiling: Data QualityDetecting DuplicatesDeduplication with GROUP BY and DISTINCTPreparing: Data CleaningCleaning Data with CASE TransformationsType Conversions and CastingDealing with Nulls: coalesce, nullif, nvl FunctionsMissing DataPreparing: Shaping DataFor Which Output: BI, Visualization, Statistics, MLPivoting with CASE StatementsUnpivoting with UNION Statementspivot and unpivot FunctionsConclusion
Date, Datetime, and Time ManipulationsTime Zone ConversionsDate and Timestamp Format ConversionsDate MathTime MathJoining Data from Different SourcesThe Retail Sales Data SetTrending the DataSimple TrendsComparing ComponentsPercent of Total CalculationsIndexing to See Percent Change over TimeRolling Time WindowsCalculating Rolling Time WindowsRolling Time Windows with Sparse DataCalculating Cumulative ValuesAnalyzing with SeasonalityPeriod-over-Period Comparisons: YoY and MoMPeriod-over-Period Comparisons: Same Month Versus Last YearComparing to Multiple Prior PeriodsConclusion
Cohorts: A Useful Analysis FrameworkThe Legislators Data SetRetentionSQL for a Basic Retention CurveAdjusting Time Series to Increase Retention AccuracyCohorts Derived from the Time Series ItselfDefining the Cohort from a Separate TableDealing with Sparse CohortsDefining Cohorts from Dates Other Than the First DateRelated Cohort AnalysesSurvivorshipReturnship, or Repeat Purchase BehaviorCumulative CalculationsCross-Section Analysis, Through a Cohort LensConclusion
Why Text Analysis with SQL?What Is Text Analysis?Why SQL Is a Good Choice for Text AnalysisWhen SQL Is Not a Good ChoiceThe UFO Sightings Data SetText CharacteristicsText ParsingText TransformationsFinding Elements Within Larger Blocks of TextWildcard Matches: LIKE, ILIKEExact Matches: IN, NOT INRegular ExpressionsConstructing and Reshaping TextConcatenationReshaping TextConclusion
Capabilities and Limits of SQL for Anomaly DetectionThe Data SetDetecting OutliersSorting to Find AnomaliesCalculating Percentiles and Standard Deviations to Find AnomaliesGraphing to Find Anomalies VisuallyForms of AnomaliesAnomalous ValuesAnomalous Counts or FrequenciesAnomalies from the Absence of DataHandling AnomaliesInvestigationRemovalReplacement with Alternate ValuesRescalingConclusion
Strengths and Limits of Experiment Analysis with SQLThe Data SetTypes of ExperimentsExperiments with Binary Outcomes: The Chi-Squared TestExperiments with Continuous Outcomes: The t-TestChallenges with Experiments and Options for Rescuing Flawed ExperimentsVariant AssignmentOutliersTime BoxingRepeated Exposure ExperimentsWhen Controlled Experiments Aren’t Possible: Alternative AnalysesPre-/Post-AnalysisNatural Experiment AnalysisAnalysis of Populations Around a ThresholdConclusion
When to Use SQL for Complex Data SetsAdvantages of Using SQLWhen to Build into ETL InsteadWhen to Put Logic in Other ToolsCode OrganizationCommentingCapitalization, Indentation, Parentheses, and Other Formatting TricksStoring CodeOrganizing ComputationsUnderstanding Order of SQL Clause EvaluationSubqueriesTemporary TablesCommon Table Expressionsgrouping setsManaging Data Set Size and Privacy ConcernsSampling with %, modReducing DimensionalityPII and Data PrivacyConclusion
Funnel AnalysisChurn, Lapse, and Other Definitions of DepartureBasket AnalysisResourcesBooks and BlogsData SetsFinal Thoughts

Content preview from SQL for Data Analysis

Chapter 8. Creating Complex Data Sets for Analysis

In Chapters 3 through 7, we looked at a number of ways in which SQL can be used to perform analysis on data in databases. In addition to these specific use cases, sometimes the goal of a query is to assemble a data set that is specific yet general-purpose enough that it can be used to perform a variety of further analyses. The destination might be a database table, a text file, or a business intelligence tool. The SQL that is needed might be simple, requiring only a few filters or aggregations. Often, however, the code or logic needed to achieve the desired data set can become very complex. Additionally, such code is likely to be updated over time, as stakeholders request additional data points or calculations. The organization, performance, and maintainability of your SQL code become critical in a way that isn’t the case for one-time analyses.

In this chapter, I’ll discuss principles for organizing code so that it’s easier to share and update. Then I’ll discuss when to keep query logic in the SQL and when to consider moving to permanent tables via ETL (extract-transform-load) code. Next, I’ll explain the options for storing intermediate results—subqueries, temp tables, and common table expressions (CTEs)—and considerations for using them in your code. Finally, I’ll wrap up with a look at techniques for reducing data set size and ideas for handling data privacy and removing personally identifiable information (PII).

When to Use ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781492088776Errata Page Supplemental Content

SQL for Data Analysis

by Cathy Tanimura

Chapter 8. Creating Complex Data Sets for Analysis

When to Use ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

SQL for Data Analytics

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

Analytics Engineering with SQL and dbt

Getting Started with SQL

Publisher Resources

Chapter 8. Creating Complex Data Sets for Analysis

When to Use ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

SQL for Data Analytics

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

Analytics Engineering with SQL and dbt

Getting Started with SQL

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.