16 Extracting text from web pages
This section covers
- Rendering web pages with HTML
- The basic structure of HTML files
- Extracting text from HTML files with the Beautiful Soup library
- Downloading HTML files from online sources
The internet is a great resource for text data. Millions of web pages offer limitless text content in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed if we download their Hypertext Markup Language (HTML) files. A markup language is a system for annotating documents that distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions ...
Get Data Science Bookcamp now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.