Chapter 20. Connecting Data

Toby Segaran

EVERY YEAR, PEOPLE INVENT DOZENS OF NEW OR REFINED STATISTICAL AND MACHINE-LEARNING TECHNIQUES for combing through data sets. What almost all of these have in common is that they presuppose the existence of a clean data set containing all the information that will be needed for the task at hand, which is often lacking in real-world situations. As Andreas Weigend, former chief scientist at Amazon, put it, "People are always asking 'what great technique can I use on this data set?' when they should be asking 'what's the best data set I can get?'"

Meanwhile, scientists are generating terabytes of data every day through their research and experiments and putting it online; governments all over the world are allowing downloads of data they have collected in operations; and the proliferation of user-generated content has created massive databases of restaurants, science fiction novels, and geolocations of streets where there was simply no comprehensive data before. So much of this is available and sits unused except by a few specialists for whom it is sufficient on its own—for everyone else it remains upsettingly free of the one or two pieces of context that would make it 10 times more valuable.

I believe some of the biggest challenges and opportunities for the current generation of data wranglers lie in connecting disparate data sets to create new sets for analysis, and in taking advantage of the proliferation of data, new techniques that have been ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.