Chapter 11. Unstructured Data and the Data Warehouse

For years, there have been two worlds that have grown up side-by-side — the world of unstructured data and related processing, and the world of structured data and related processing. It is a shame that these worlds have had very little intersection, because a plethora of business opportunities opens up when an interface between the two worlds is created.

The world of unstructured data is one that is dominated by casual, informal activities such as those found on the personal computer and the Internet. The following are typical of the data formats for unstructured data:

  • Emails

  • Spreadsheets

  • Text files

  • Documents

  • Portable Document Format (.PDF) files

  • Microsoft PowerPoint (.PPT) files

Figure 11-1 shows the world of unstructured data.

The polar opposite of unstructured data is structured data. Structured data is typified by standard DBMSs, reports, indexes, databases, fields, records, and the like. Figure 11-2 depicts the structured world.

The unstructured environment is aptly named because it contains practically no format, records, or keys. People get on the Internet and say what they want with no guidance from anyone else. People build and change spreadsheets with no instructions from anyone. People write reports and memos to their satisfaction alone. In short, there is no structure whatsoever to the unstructured environment. Furthermore, there is a lot of what can be called "blather" in the unstructured environment. Blather is simply communications ...

Get Building the Data Warehouse now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.