Impediments to Connecting Data

Hopefully you're starting to be convinced that there are huge advantages to being able to easily integrate data from many different sources. But there are a few different reasons people aren't doing it already….

The Representation Problem

Perhaps the most basic problem with attempting to connect data sets is the fact that most data is stored in very inflexible structures. First of all, a surprising amount of important data in science and business is also kept in Excel spreadsheets, which are stored locally on people's computers, inaccessible to others and also not designed for integration anyway.

Even in companies where databases are made accessible, data is classically stored in relational databases, most of which have predefined schemas to fit the data that was initially believed to be important. Figure 20-2 shows a simple example of a relational schema for restaurant data. This is excellent for large, predictable data sets because relational databases have excellent performance when well configured, but presents problems when the application requires new kinds of data, new fields, or new relationships to be added frequently.

A relational schema for restaurant data.

Figure 20-2. A relational schema for restaurant data.

I've seen people solve this problem in a number of ways, but two really stand out, mostly because they're opposite ends of a spectrum. The traditional approach is to continually ...

Get Beautiful Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.