9

Using Regular Expressions and PDFs

So far, we have learned about and explored some of the core Python libraries in the context of web communication, content reading, and browser automation, for data finding and extraction.

Regular expressions (also referred to as Regex, regex, or RegEx – we will use regex throughout the rest of this chapter) are built using a predefined set of characters to form a pattern used for searching and similar activities. In Chapters 3 and 4, when carrying out web scraping, we tested and applied various available features, such as CSS selectors, XPath, and PyQuery, to find and locate specific types of activities. Regex helps us with pattern matching – we are knowingly or unknowingly using regex most of the time while ...

Get Hands-On Web Scraping with Python - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.