13 Parsing information from semistructured documents

We have learned how to handle and extract information from well-defined data structures like XML or JSON. There are standardized methods for translating these formats into R data structures. Content on the Web is highly heterogeneous, however. We are occasionally confronted with data which are structured but in a format for which no parser exists.

In this chapter, we demonstrate how to construct a parser that is able to transform pure character data into R data structures. As an example we identified climate data that are offered by the Natural Resources Conservation Service at the United States Department of Agriculture.1 We focus on a set of text files that can be downloaded from an FTP server.2 While the download procedure is simple, the files cannot be put into an R data structure directly. An excerpt from one of these files is shown in Figure 13.1. The displayed data are structured in a way which is human-readable but not (yet) understandable by a computer program. The main goal is to describe the structure in a way that a computer can handle them.

Over the course of the case study we make use of RCurl to list files on and retrieve them from FTP servers and draw on R’s text manipulation capabilities to build a parser for the data files. Regular expressions are a crucial tool to solve this task.

13.1 Downloading data from the FTP server

First, we load RCurl and stringr. As we have learned, RCurl provides functionality ...

Get Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.