book

The Python Book

by Rob Mastrodomenico

January 2022

Beginner to intermediate

272 pages

6h 51m

English

Wiley

Content preview from The Python Book

19Web Scraping in Python

The last chapter of the book covers the concept of web scraping. This is the programmatic process of obtaining information from a web page. To do this we need to get up to speed on a number of things:

html
obtaining a webpage
getting information from the webpage

To do this we will create our own website using Python that we will scrape with our own code.

19.1 An Introduction to HTML

HTML stands for Hyper Text Markup Language and is the standard markup language for creating web pages. It is essentially the language that makes up what you see on the internet. An HTML file tells a web browser how to display the text, images, and other content on a webpage. The purpose of HTML is to describe how the content is structured and not how it will be styled, and rendered within a web browser. To render the page you need to use a cascading style sheet (CSS) and an HTML page can link to a CSS file to get information on colours, fonts, and other information relating to the rendering of the page.

HTML is a markup language, so in creating HTML content you are embedding the text to be displayed alongside how the text should be displayed. The way this is done is by using HTML tags which can contain name‐value pairs which are known as attributes. Information within a tag is known as an HTML element. Well‐formed HTML should have an open and a close tags, and before you start a new tag you should close off your old tag.

Now, that we have described what HTML is we will ...