1

Tabular Formats

Tidy datasets are all alike, but every messy dataset is messy in its own way.

–Hadley Wickham (cf. Leo Tolstoy)

A great deal of data both does and should live in tabular formats; to put it flatly, this means formats that have rows and columns. In a theoretical sense, it is possible to represent every collection of structured data in terms of multiple “flat” or “tabular” collections if we also have a concept of relations. Relational database management systems (RDBMSs) have had a great deal of success since 1970, and a very large part of all the world’s data lives in RDBMSs. Another large share lives in formats that are not relational as such, but that are nonetheless tabular, wherein relationships may be imputed in an ad ...

Get Cleaning Data for Effective Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.