O'Reilly logo
live online training icon Live Online training

Web scraping in 60 minutes

Retrieve, parse, and store data from any website with Python

enter image description here

Topic: Data
Max Humber

Websites contain lots of useful data. Extracting that data is often difficult because websites are designed for humans (not bots), each page is different, and some are intentionally difficult to interpret. Learning how to effectively parse HTML is a crucial skill for professional Python developers and Python hobbyists alike.

Web scraping is foundational for product review, price comparison, and reputation tracking applications. It benefits projects that use internal data, as data is geometrically more valuable when it’s matched and fused with other sources of data.

Expert Max Humber guides you through the web scraping process from start to finish. Join in to build the skills to supercharge your personal and professional projects.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand: - How to scrape nearly any website - How to structure requests to include query strings and headers - How to effectively manipulate text stored in an HTML document

And you’ll be able to: - Use the progress bar library tqdm to monitor the performance and speed of your scrapers - Save the results of a scraper for later use

This training course is for you because...

This course is for you because… - You use Python regularly. - You want to scrape websites for personal and professional projects. - You want to learn about the latest and greatest scraping tools.

Prerequisites

Prerequisites: - Experience with Python

Recommended follow-up: - Read Web Scraping with Python, second edition (book) - Read Learn Selenium (book)

About your instructor

  • Max Humber is a distinguished faculty member at General Assembly and the author of Personal Finance with Python. Previously, he was the first data scientist at Borrowell and the second data engineer at Wealthsimple.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (7 minutes)

  • Group discussion: Introductions; number of websites you’ve scraped before; professional or personal interest
  • Lecture: HTML and CSS basics; learning agenda

Retrieve (8 minutes) - Lecture: Requesting and downloading web page contents; request and response types; URL structure, param payloads, and headers - Hands-on exercise: Build a request URL - Q&A

Parse (25 minutes) - Lecture: Finding and extracting text based on HTML tag elements and attributes; string manipulation techniques and list comprehensions for scraping; looping, sleeping, and monitoring; hacking HTML tables with pandas - Hands-on exercise: Scrape a Wikipedia page - Q&A

Store (5 minutes) - Lecture: Saving results with context managers, pandas, and SQLite

Advanced scraping (10 minutes) - Lecture: Using mechanize to automate website authentication; parsing JavaScript with Selenium; spotlight on API wrappers - Hands-on exercise: Create a Selenium scraper

Wrap-up and Q&A (5 minutes)