11

Working with Unstructured and Semistructured Data

Unstructured data is a misleading term, but we are going to use it for the lack of a better one. With the exception of absolute chaos, all data should be considered to have a certain degree of organization. Consider a paperback novel: It has a table of contents; text is organized into chapters with paragraphs; and sentences have commas, hyphens, and periods. The information is usually arranged in a logical progression; you can read it cover to cover, select chapters, look at the pictures, and so on. Yet in context of relational databases in general, and this book in particular, the novel would be an example of unstructured information because it is extremely difficult to apply information technology to this type of data. The term unstructured, therefore, defines degrees of suitability of data for computer processing.

We are surrounded by a multitude of unstructured data — for example, books, magazines, conversations, pictures, movies, text messages, newspapers, TV shows, and music. Most of it passes by and disappears into oblivion — or at least it used to before cheap storage came into existence, along with the hardware powerful enough to handle data in a timely manner and the software to manage it. In order for the data to be managed under an RDBMS, it has to be digitized first; once pictures and texts are stored as long sequences of ones and zeroes, the data can be further categorized into character data and binary data. ...

Get Discovering SQL: A Hands-On Guide for Beginners now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.