Skip to Content
Data Science Bookcamp
book

Data Science Bookcamp

by Leonard Apeltsin
November 2021
Beginner to intermediate
704 pages
20h 16m
English
Manning Publications
Content preview from Data Science Bookcamp

16 Extracting text from web pages

This section covers

  • Rendering web pages with HTML
  • The basic structure of HTML files
  • Extracting text from HTML files with the Beautiful Soup library
  • Downloading HTML files from online sources

The internet is a great resource for text data. Millions of web pages offer limitless text content in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed if we download their Hypertext Markup Language (HTML) files. A markup language is a system for annotating documents that distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Introducing Data Science

Introducing Data Science

Arno Meysman, Davy Cielen, Mohamed Ali
Learning Data Science

Learning Data Science

Sam Lau, Joseph Gonzalez, Deborah Nolan

Publisher Resources

ISBN: 9781617296253Publisher SupportOtherPublisher WebsiteSupplemental ContentErrata PagePurchase Link