16 Extracting text from web pages

This section covers

Rendering web pages with HTML
The basic structure of HTML files
Extracting text from HTML files with the Beautiful Soup library
Downloading HTML files from online sources

The internet is a great resource for text data. Millions of web pages offer limitless text content in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed if we download their Hypertext Markup Language (HTML) files. A markup language is a system for annotating documents that distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions ...

Get Data Science Bookcamp now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Science Bookcamp by Leonard Apeltsin

16 Extracting text from web pages

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly