Chapter 6. SQL in Data Science
For decades, SQL has been the basis of data management. Nearly every data professional, regardless of their specific domain, is likely to have encountered SQL as their entry point into data. Its presence is so fundamental that it often serves as the connective tissue between vast data warehouses and the analytic tools that drive business decisions, as seen in previous chapters.
However, SQL’s role in data science is often perceived as limited—a mere vehicle for data extraction. Once the data is retrieved, the prevailing wisdom is to shift to more specialized tools like Python, R, or Julia for analysis, statistics, and machine learning applications. This view, while widespread, underestimates both the capabilities and the evolving potential of SQL itself.
Is this separation always necessary—or even optimal? While every tool has its strengths and intended use cases, this separation may prematurely sideline SQL’s capabilities. Modern SQL engines have expanded far beyond simple SELECT statements. With the advent of advanced analytical functions, window operations, and even native machine learning capabilities (as seen in platforms like Google BigQuery and Snowflake), SQL is steadily encroaching on territory traditionally reserved for general-purpose programming languages, with newer capabilities being added frequently (e.g. the inclusion of LLMs as built-in functions in several data platforms).
But before relegating SQL to a supporting role in the data ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access