O'Reilly logo
live online training icon Live Online training

Advanced Web Scraping

Data

Max Humber

Scraping data from a website like Wikipedia or sports-reference.com is pretty easy. Everything is rendered with vanilla HTML/CSS, and the tag elements are predictable and well labeled.

But what if the data you need to scrape isn’t tagged properly? Or it’s locked behind behind a login page, requires clicking and scrolling to get at, or is rendered with JavaScript? What then? Most likely you will have given up and moved on... No more!

In this live training, Max will help you take your web scraping skills to the next level so that you will be better equipped for the next pesky page that you have to scrape!

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • Why some websites are harder to scrape than others
  • How to scrape data that is rendered in-browser with JavaScript
  • How to automate some browser tasks (like clicking and scrolling)

And you’ll be able to:

  • Schedule scraping jobs on a server
  • Setup notification and email triggers based on certain events

This training course is for you because...

  • You already have some web scraping experience, such as by taking Web Scraping in 60 Minutes (live online training course with Max Humber)
  • You want to scrape more difficult websites for personal and professional projects
  • You want to learn about the latest and greatest scraping tools

Prerequisites

  • Required: Experience with Python, and familiarity with BeautifulSoup
  • Optional: Take Web Scraping in 60 Minutes (live online training course with Max Humber)

Recommended preparation:

Recommended follow-up:

About your instructor

  • Max Humber is a lead instructor at General Assembly and the author of Personal Finance with Python. He was the first data scientist at Borrowell and the second data engineer at Wealthsimple.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (5 minutes)

  • Who am I, and who are you?
  • Poll:
  • Poll:
  • Learning Agenda

Basics (5 minutes)

  • A quick review on how to fetch HTML and quickly parse it
  • How target HTML element tags and attributes
  • Exercise: Scrape a “simple” website

Pesky Pages (15 minutes)

  • How to scrape data locked behind a login page
  • How to scrape data rendered with JavaScript
  • Exercise: Scrape a website with login credentials
  • Q&A (5 minutes)

Scheduling (10 minutes)

  • How to put a scraper on a schedule
  • How to send emails with scraping results
  • Exercise: Schedule a scraper

Browser Automation (15 minutes)

  • Replicate scrolling and browser clicks to get at hard to parse data
  • How to leverage Optical Character Recognition (OCR)
  • How to scrape images and other multimedia types
  • Exercise: Use OCR to parse non-text text data

Conclusion + Q&A (5 minutes)