book

Advanced SQL

by Rui Machado, Hélder Russa, Pedro Esmeriz

July 2026

Intermediate to advanced

387 pages

10h 45m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Modern SQL
1. Evolution of SQL
Historical Overview: From Relational to Modern SQLStandard SQL-86: Laying the FoundationsStandard SQL-89: Adding Integrity and SecurityStandard SQL-92: Adding Modelling CapabilitiesStandard SQL-1999: Handling Industry NeedsFrom the SQL-2003 Standard to SQL-2008From the SQL-2011 Standard to SQL-2016The Graph Revolution with SQL:2023Transition to Multi-Purpose SQLSemi-Structured Data Handling in SQLTemporal Data ManagementExtended Analytical and OLAP FunctionsNavigating Graph Structures in SQLDealing with Unstructured Data in SQLIntegrating Machine Learning and LLMs in SQL PlatformsSQL for Data TransformationsEmbedded Analytics with DuckDBComparison of SQL Dialects over TimeSummary
2. Advanced SQL Techniques
Recursive Queries, Window Functions, and CTEsCommon Table Expressions and Subquery FactoringRecursive CTEs for Hierarchical and Iterative QueriesAdvanced Window Functions for Analytic QueriesPerformance Optimization StrategiesIndexing and Efficient Data AccessQuery Planning and Cost-Based OptimizationQuery Rewriting, Caching, and Materialized ViewsAvoiding Common SQL Anti-PatternsTuning in Cloud Data Warehouses: BigQuery Versus SnowflakeSummary
3. Architecting SQL-Driven Solutions
Foundational Design Principles for SQL Data PipelinesDesigning Modular SQL Data PipelinesFormalizing Data Agreements: Data ContractsArchitectural Patterns for Data PipelinesBatch Versus Streaming: Choosing the Right SQL Pipeline ApproachTransformation Patterns: ETL, ELT, and StreamingBeyond ETL/ELT and Streaming: Zero-ETL and Data SharingData Mesh and Data ProductsArchitectures for ScalabilityTraditional RDBMS ScalingCloud Native ScalingConcurrency and Isolation in PracticeHybrid and Multi-Cloud PatternsBuilding and Operationalizing Scalable SQL Data PlatformsLayered Architecture: Separating Storage, Transformation, and OrchestrationFrom Design to Production: Scaling SQL Data SolutionsAvoiding Common Architectural Anti-PatternsTight Coupling Between Pipeline ComponentsMonolithic “All-in-One” SQL ScriptsMissing Lineage and MonitoringOver-Coupling to Specific TechnologiesNot Designing for EvolutionSkipping Tests for SQL ModelsIgnoring Governance and SecuritySummary
II. SQL for Data Engineering[.keep-together]# and Data Science#
4. Advanced Data Modeling with SQL
Traditional SQL Data Modeling TechniquesThird Normal Form—Structured, Relational ModelingDimensional Modeling: Star Schema for AnalyticsSnowflake Schemas: Adding Normalization to DimensionsOne Big Table—The Wide Denormalized Table ApproachLakehouse Modeling and ELT WorkflowsSummary of Traditional ModelsKnowledge Graphs and Semantic Modeling—The New Frontier for SQLWhat is a Knowledge Graph?Graph Technologies - Neo4J and CypherQuerying with Cypher versus SQLSemantic Models and Ontologies (RDF & OWL)RDF and OWL: The Foundations of Semantic ModelingProperty Graphs and CypherPostgreSQL and GraphDBSQL and Graphs: Complementary, Not CompetingSummaryLakehouse Modeling and ELT WorkflowsSQL and Graph Ecosystem
5. Building Data Engineering Solutions with SQL
Pipeline Overview and Design PrinciplesSetting Up the GCP EnvironmentData Ingestion in a SQL-Centric ArchitectureStreaming Ingestion with Spark Structured Streaming and a Flavor of Apache Flink SQLBatch Ingestion for Bulk DataStorage LayerThe Data LakehouseData Storage and Management with Apache IcebergTransformation and Data Modeling with SQLSilver and Gold Model Design in dbt: Building a Medallion ArchitectureOrchestration and Workflow ManagementCloud Workflows Orchestration PipelineScheduling the WorkflowsAlternative Orchestration optionsConsumption LayerA Stable Interface for Data ConsumptionBridging Semantics to ExecutionSQL-First Semantic ExecutionSemantic Contracts are Testable ContractsOperationalizing the Semantic LayerLive Data in Consumption LayerSecurity and Governance ConsiderationsIdentity and Execution BoundariesDataset and Schema Boundaries as GovernanceGovernance Via dbt (Documentation, Lineage and Exposures)Security vs. Semantics: Consistent Definitions as ControlSummary
6. SQL in Data Science
Data Processing & AnalysisDataset Statistical DescriptionCorrelation and AssociationRegression from AggregatesOutlier DetectionAdvanced Statistical AnalysisNormality TestingConfidence Intervals: Estimating the MeanTwo-sample TestsMultiple Groups TestingMachine LearningAdvanced Feature Engineering PatternsModel Creation and TrainingModel Evaluation and ValidationModel Deployment and MonitoringSummary
III. SQL in Emerging Technologies

7. Generative AI in SQL
Generative AI Inside SQLCore Patterns of SQL + GenAIAI-Ready DataWhen SQL Should Run GenAI WorkloadsPrimitives: LLMs, Embeddings & SearchLLM Functions Inside SQLEmbeddings, Search, and RetrievalPatterns: RAG, Agents, and MCPsGrounded Generation (the “G” in RAG)SQL-Based Agents and MCPsProductionizing Generative AI in the Data WarehouseLogging, Monitoring & Evaluating GenAI OutputsCost & Performance EngineeringSecurity & Privacy ControlsPractical Deployment ArchitecturesProduction Readiness ChecklistSummary
8. Innovations in SQL Syntax and the Future of SQL
GoogleSQL Pipe SyntaxCore Syntax and Semantics of Pipe QueriesPipe Operators OverviewApplied PipeSQL by ExamplesReal-World Use Cases and Applications of PipeSQLBest Practices and Style Guidelines for PipeSQL SyntaxParallel Innovations in SQL ExpressionPRQL: A Pipelined Query LanguageMalloy: A Semantic Modeling LanguageThe Future of SQLAI and Autonomous DatabasesThe Quantum Horizon: Quantum-Assisted OptimizationDeclarative Governance and PoliciesSummary
Index
About the Author(s)

Content preview from Advanced SQL

Chapter 6. SQL in Data Science

For decades, SQL has been the basis of data management. Nearly every data professional, regardless of their specific domain, is likely to have encountered SQL as their entry point into data. Its presence is so fundamental that it often serves as the connective tissue between vast data warehouses and the analytic tools that drive business decisions, as seen in previous chapters.

However, SQL’s role in data science is often perceived as limited—a mere vehicle for data extraction. Once the data is retrieved, the prevailing wisdom is to shift to more specialized tools like Python, R, or Julia for analysis, statistics, and machine learning applications. This view, while widespread, underestimates both the capabilities and the evolving potential of SQL itself.

Is this separation always necessary—or even optimal? While every tool has its strengths and intended use cases, this separation may prematurely sideline SQL’s capabilities. Modern SQL engines have expanded far beyond simple SELECT statements. With the advent of advanced analytical functions, window operations, and even native machine learning capabilities (as seen in platforms like Google BigQuery and Snowflake), SQL is steadily encroaching on territory traditionally reserved for general-purpose programming languages, with newer capabilities being added frequently (e.g. the inclusion of LLMs as built-in functions in several data platforms).

But before relegating SQL to a supporting role in the data ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341627475Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced SQL

by Rui Machado, Hélder Russa, Pedro Esmeriz

Chapter 6. SQL in Data Science

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.