19Web Scraping in Python
The last chapter of the book covers the concept of web scraping. This is the programmatic process of obtaining information from a web page. To do this we need to get up to speed on a number of things:
- html
- obtaining a webpage
- getting information from the webpage
To do this we will create our own website using Python that we will scrape with our own code.
19.1 An Introduction to HTML
HTML stands for Hyper Text Markup Language and is the standard markup language for creating web pages. It is essentially the language that makes up what you see on the internet. An HTML file tells a web browser how to display the text, images, and other content on a webpage. The purpose of HTML is to describe how the content is structured and not how it will be styled, and rendered within a web browser. To render the page you need to use a cascading style sheet (CSS) and an HTML page can link to a CSS file to get information on colours, fonts, and other information relating to the rendering of the page.
HTML is a markup language, so in creating HTML content you are embedding the text to be displayed alongside how the text should be displayed. The way this is done is by using HTML tags which can contain name‐value pairs which are known as attributes. Information within a tag is known as an HTML element. Well‐formed HTML should have an open and a close tags, and before you start a new tag you should close off your old tag.
Now, that we have described what HTML is we will ...
Get The Python Book now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.