chapter 4R Data, Part 3: Text and Factors

A lot of data comes in character (“string”) form, sometimes because it really is text, and sometimes because it was originally intended to be numeric but included a small number of non-numeric items such as, for example, the word “Missing.” Almost every data cleaning problem requires manipulating text in some way, to find entries that include particular strings, to modify column names, or something else. In this chapter, we describe some of the operations you can perform on character data. This includes extracting pieces of strings, formatting numbers as text, and searching for matches inside text.

However, there are really two ways that character data can be stored in R. One is as a vector of character strings, as we saw in Chapter 2. The tools we mentioned above are primarily for this sort of data. A second way text can appear in R is as a factor, which is a way of storing individual text entries as integers, together with a set of character labels that match the integers back to the text. Factors are important in many R modeling functions, but they can cause trouble. We discuss factors in Section 4.6.

One consideration has become much more important in recent years: handling text from alphabets other than the English one. We are very often called on to deal with text containing accented characters from Western European languages, and increasingly, particularly as a result of data from social media sources, we find ourselves with text ...

Get A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.