Chapter 9. Web Scraper
This chapter covers the following:
-
Scraping contents from an HTML website
-
Running a headless browser
-
Collecting scraped data and offering it as an API
In this chapter, you’ll build a web scraper to collect contents of HTML websites and process them in your Node server. Web scraping is the process of extracting data from a website and has existed for nearly as long as the internet itself.
Before APIs were made public and accessible by businesses, the only way to get updated product pricing, immediate news headlines, and static web content was by manually visiting a web URL and looking directly at the resulting web page; kind of like how we still mostly use the internet today.
There are now more ways to process and use data than ever before, but still not enough APIs to feed those processing systems. Even where APIs exist, restrictions in the type of data an end user has access to may be limited. For example, a food delivery service may provide an API for the top restaurants in a neighborhood, but not allow for filtering by dietary restrictions. To get that data, you could, alternatively, visit the URL for results filtered by dietary restriction and scrape the resulting contents. In this chapter, you’ll explore the options available within the Node ecosystem to allow for scraping web pages and using that data in your own application.
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access