Skip to Content
Web Scraping with Python
book

Web Scraping with Python

by Ryan Mitchell
July 2015
Intermediate to advanced
256 pages
6h 28m
English
O'Reilly Media, Inc.
Content preview from Web Scraping with Python

Chapter 6. Reading Documents

It is tempting to think of the Internet primarily as a collection of text-based websites interspersed with newfangled web 2.0 multimedia content that can mostly be ignored for the purposes of web scraping. However, this ignores what the Internet most fundamentally is: a content-agnostic vehicle for transmitting files.

Although the Internet has been around in some form or another since the late 1960s, HTML didn’t debut until 1992. Until then, the Internet consisted mostly of email and file transmission; the concept of web pages as we know them today didn’t really exist. In other words, the Internet is not a collection of HTML files. It is a collection of information, with HTML files often being used as a frame to showcase it. Without being able to read a variety of document types, including text, PDF, images, video, email, and more, we are missing out on a huge part of the available data.

This chapter covers dealing with documents, whether you’re downloading them to a local folder or reading them and extracting data. We’ll also take a look at dealing with various types of text encoding, which can make it possible to even read foreign-language HTML pages.

Document Encoding

A document’s encoding tells applications—whether they are your computer’s operating system or your own Python code—how to read it. This encoding can usually be deduced from its file extension, although this file extension is not mandated by its encoding. I could, for example, save ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Hands-On Web Scraping with Python

Hands-On Web Scraping with Python

Anish Chapagain
Python Web Scraping Cookbook

Python Web Scraping Cookbook

Lazar Telebak, Michael Heydt, Mei Lu

Publisher Resources

ISBN: 9781491910283Errata PageSupplemental Content