Chapter 5. Text Analysis

In the last two chapters, we explored applications of dates and numbers with time series analysis and cohort analysis. But data sets are often more than just numeric values and associated timestamps. From qualitative attributes to free text, character fields are often loaded with potentially interesting information. Although databases excel at numeric calculations such as counting, summing, and averaging things, they are also quite good at performing operations on text data.

I’ll begin this chapter by providing an overview of the types of text analysis tasks that SQL is good for, and of those for which another programming language is a better choice. Next, I’ll introduce our data set of UFO sightings. Then we’ll get into coding, covering text characteristics and profiling, parsing data with SQL, making various transformations, constructing new text from parts, and finally finding elements within larger blocks of text, including with regular expressions.

Why Text Analysis with SQL?

Among the huge volumes of data generated every day, a large portion consists of text: words, sentences, paragraphs, and even longer documents. Text data used for analysis can come from a variety of sources, including descriptors populated by humans or computer applications, log files, support tickets, customer surveys, social media posts, or news feeds. Text in databases ranges from structured (where data is in different table fields with distinct meanings) to semistructured ...

Get SQL for Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.