Web Scraping in 60 Minutes
Websites contain lots of useful data. Extracting that data, however, is often difficult because websites are designed for humans (and not bots), each page is different, and some are made intentionally difficult to parse. Learning how to effectively parse HTML, therefore, is a crucial skill for the professional Python developer and Python hobbyist alike.
Web scraping is foundational for product review, price comparison, and reputation tracking applications. And it can supercharge many projects that use internal data, as data is geometrically more valuable when it’s matched and fused with other sources of data.
In “Web Scraping in 60 Minutes” Max will guide you through the web scraping process, from start to finish. So that you can supercharge your personal and/or professional projects.
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- How to scrape nearly any website
- How to structure requests to include query strings and headers
- How to effectively manipulate text stored in an HTML document
And you’ll be able to:
- Use tqdm (a progress bar library) to monitor the performance and speed of your scrapers
- Save the results of a scraper for later use
This training course is for you because...
- You are someone who uses Python regularly
- You want to scrape websites for personal and professional projects
- You want to learn about the latest and greatest scraping tools
- Some Python experience
About your instructor
Max is a Lead Instructor at General Assembly and the author of Personal Finance with Python. He was the first Data Scientist at Borrowell and the second Data Engineer at Wealthsimple.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction ( 7 minutes )
- Discussion: Who am I, and who are you?
- Presentation: HTML/CSS Basics
- Presentation: Learning Agenda
Retrieve ( 8 minutes )
- Presentation: Request and download the contents of a webpage
- Presentation: Request and response types
- Presentation: URL structure, param payloads, and headers
- Exercise: Build a request URL
Parse ( 25 minutes )
- Presentation: Find and extract text based on HTML tag elements and attributes
- Presentation: String manipulation techniques and list comprehensions for scraping
- Presentation: Looping, sleeping, and monitoring
- Presentation: Hack HTML tables with pandas
- Exercise: Scrape a Wikipedia page
Store ( 5 minutes )
- Presentation: Save results with context managers
- Presentation: Save results with pandas
- Presentation: Save results with sqlite
Advanced Scraping ( 10 minutes )
- Presentation: Use Mechanize to automate website authentication
- Presentation: Spotlight on API wrappers
- Exercise: Create a Selenium Scraper
Conclusion + Q&A ( 5 minutes )