8 Regular expressions and essential string functions

The Web consists predominantly of unstructured text. One of the central tasks in web scraping is to collect the relevant information for our research problem from heaps of textual data. Within the unstructured text we are often interested in systematic information—especially when we want to analyze the data using quantitative methods. Systematic structures can be numbers or recurrent names like countries or addresses. We usually proceed in three steps. First we gather the unstructured text, second we determine the recurring patterns behind the information we are looking for, and third we apply these patterns to the unstructured text to extract the information. This chapter will focus on the last two steps. Consider HTML documents from the previous chapters as an example. In principle, they are nothing but collections of text. Our goal is always to identify and extract those parts of the document that contain the relevant information. Ideally we can do so using XPath—but sometimes the crucial information is hidden within atomic values. In some settings, relevant information might be scattered across an HTML document, rendering approaches that exploit the document structure useless. In this chapter we introduce a powerful tool that helps retrieve data in such settings—regular expressions. Regular expressions provide us with a syntax for systematically accessing patterns in text.

Consider the following short example. Imagine ...

Get Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.