O'Reilly logo

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining by Dominic Nyhuis, Peter Meissner, Christian Rubba, Simon Munzert

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

9 Scraping the Web

Having learned much about the basics of the architecture of the Web, we now turn to data collection in practice. In this chapter, we address three main aspects of web scraping with R. The first is how to retrieve data from the Web in different scenarios (Section 9.1). Recall Figure 1.4. The first part of the chapter looks at the stage where we try to get resources from servers into R. The principal technology to deal with in this step is HTTP. We offer a set of real-life scenarios that demonstrate how to use libcurl to gather data in various settings. In addition to examples based on HTTP or FTP communication, we introduce the use of web services (web application programming interfaces [APIs]) and a related authentication standard, OAuth. We also offer a solution for the problem of scraping dynamic content that we described in Chapter 6. Section 9.1.9 provides an introduction to Selenium, a browser automation tool that can be used to gather content from JavaScript-enriched pages.

The second part of the chapter turns to strategies for extracting information from gathered resources (Section 9.2). We are already familiar with the necessary technologies: regular expressions (Chapter 8) and XPath (Chapter 4). From a technology-based perspective, this corresponds to the second column of Figure 1.4. In this part we shed light on these techniques from a more practical perspective, providing a stylized sketch of the strategies and discuss their advantages and disadvantages. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required