Chapter 11
Scraping Web Pages
IN THIS CHAPTER
Understanding screen scraping
Scraping elements from web pages
Extracting data from web pages
Automating internet data extraction
In the previous chapter, you automated the web browser to fill out forms. The star of that show was the Selenium library. In this chapter, you automate the browser to extract data from websites instead of entering it.
The technique you’ll use is sometimes called web scraping. It’s also sometimes called screen scraping, because it seems as though the code is pulling content right from the screen. In reality, the content is pulled from the web page .html or .htm file. So, you can extract Hypertext Markup Language (HTML) tags along with any other content on the page.
Picking the Right Tools for Web Scraping
The most widely used module for web scraping is BeautifulSoup, from the bs4 package. An optional secondary tool, lxml, offers some speed advantages over html.parser, which is part of the Python standard library, for extracting content from the web page.
BeautifulSoup is also often used with the requests ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access