O'Reilly logo
live online training icon Live Online training

Web Scraping in 60 Minutes

Max Humber

Websites contain lots of useful data. Extracting that data, however, is often difficult because websites are designed for humans (and not bots), each page is different, and some are made intentionally difficult to parse. Learning how to effectively parse HTML, therefore, is a crucial skill for the professional Python developer and Python hobbyist alike.

Web scraping is foundational for product review, price comparison, and reputation tracking applications. And it can supercharge many projects that use internal data, as data is geometrically more valuable when it’s matched and fused with other sources of data.

In “Web Scraping in 60 Minutes” Max will guide you through the web scraping process, from start to finish. So that you can supercharge your personal and/or professional projects.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to scrape nearly any website
  • How to structure requests to include query strings and headers
  • How to effectively manipulate text stored in an HTML document

And you’ll be able to:

  • Use tqdm (a progress bar library) to monitor the performance and speed of your scrapers
  • Save the results of a scraper for later use

This training course is for you because...

  • You are someone who uses Python regularly
  • You want to scrape websites for personal and professional projects
  • You want to learn about the latest and greatest scraping tools

Prerequisites

  • Some Python experience

Recommended follow-up:

About your instructor

  • Max is a Lead Instructor at General Assembly and the author of Personal Finance with Python. He was the first Data Scientist at Borrowell and the second Data Engineer at Wealthsimple.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction ( 7 minutes )

  • Discussion: Who am I, and who are you?
  • Presentation: HTML/CSS Basics
  • Presentation: Learning Agenda

Retrieve ( 8 minutes )

  • Presentation: Request and download the contents of a webpage
  • Presentation: Request and response types
  • Presentation: URL structure, param payloads, and headers
  • Exercise: Build a request URL
  • Q&A

Parse ( 25 minutes )

  • Presentation: Find and extract text based on HTML tag elements and attributes
  • Presentation: String manipulation techniques and list comprehensions for scraping
  • Presentation: Looping, sleeping, and monitoring
  • Presentation: Hack HTML tables with pandas
  • Exercise: Scrape a Wikipedia page
  • Q&A

Store ( 5 minutes )

  • Presentation: Save results with context managers
  • Presentation: Save results with pandas
  • Presentation: Save results with sqlite

Advanced Scraping ( 10 minutes )

  • Presentation: Use Mechanize to automate website authentication
  • Presentation: Parse JavaScript with Selenium
  • Presentation: Spotlight on API wrappers
  • Exercise: Create a Selenium Scraper

Conclusion + Q&A ( 5 minutes )