Advanced Web Scraping
Scraping data from a website like Wikipedia or sports-reference.com is pretty easy. Everything is rendered with vanilla HTML/CSS, and the tag elements are predictable and well labeled.
In this live training, Max will help you take your web scraping skills to the next level so that you will be better equipped for the next pesky page that you have to scrape!
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- Why some websites are harder to scrape than others
- How to automate some browser tasks (like clicking and scrolling)
And you’ll be able to:
- Schedule scraping jobs on a server
- Setup notification and email triggers based on certain events
This training course is for you because...
- You already have some web scraping experience, such as by taking Web Scraping in 60 Minutes (live online training course with Max Humber)
- You want to scrape more difficult websites for personal and professional projects
- You want to learn about the latest and greatest scraping tools
- Required: Experience with Python, and familiarity with BeautifulSoup
- Optional: Take Web Scraping in 60 Minutes (live online training course with Max Humber)
- Download and install Selenium
About your instructor
Max Humber is a lead instructor at General Assembly and the author of Personal Finance with Python. He was the first data scientist at Borrowell and the second data engineer at Wealthsimple.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (5 minutes)
- Who am I, and who are you?
- Learning Agenda
Basics (5 minutes)
- A quick review on how to fetch HTML and quickly parse it
- How target HTML element tags and attributes
- Exercise: Scrape a “simple” website
Pesky Pages (15 minutes)
- How to scrape data locked behind a login page
- Exercise: Scrape a website with login credentials
- Q&A (5 minutes)
Scheduling (10 minutes)
- How to put a scraper on a schedule
- How to send emails with scraping results
- Exercise: Schedule a scraper
Browser Automation (15 minutes)
- Replicate scrolling and browser clicks to get at hard to parse data
- How to leverage Optical Character Recognition (OCR)
- How to scrape images and other multimedia types
- Exercise: Use OCR to parse non-text text data
Conclusion + Q&A (5 minutes)