Chapter 5Cleaning Text Data

Preparing text data takes place on a number of levels. From the bottom up there are technical (encoding) issues, string issues which are related to the structuring of the storage format, case folding, and so on, and semantic issues which relate to the meaning of text.

Cleaning on the technical level mostly concerns solving encoding issues. That is, can we properly interpret the sequence of bytes comprising a file as a set of symbols? We will find that there is no definite answer to this but there are tools that implement effective heuristics that work in many cases.

At the string level, we find issues like checking the file format (is a file proper csv or HTML?), cleaning up punctuation or spurious white spaces, and localizing typos by approximate matching against a list of known or allowed terms. It also includes extracting the interesting portions of text string, for example, by harvesting the url that is stored in a <a href="[url]"> tag of an HTML file. Activities at the string level therefore consist of more or less classical techniques for finding, replacing, and substituting substrings as well as approximate matching of strings based on string metrics. We shall find that methods based on pattern matching and methods based on approximate matching are in a way complementary approaches to common text cleaning problems and it is effective to combine them.

The semantic domain involves activities that make use of the meaning of text. Here, one may ...

Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.