Chapter 1. How the Internet Works
I have met very few people in my life who truly know how the internet works, and I am certainly not one of them.
The vast majority of us are making do with a set of mental abstractions that allow us to use the internet just as much as we need to. Even for programmers, these abstractions might extend only as far as what was required for them to solve a particularly tricky problem once in their career.
Due to limitations in page count and the knowledge of the author, this chapter must also rely on these sorts of abstractions. It describes the mechanics of the internet and web applications, to the extent needed to scrape the web (and then, perhaps a little more).
This chapter, in a sense, describes the world in which web scrapers operate: the customs, practices, protocols, and standards that will be revisited throughout the book.
When you type a URL into the address bar of your web browser and hit Enter, interactive text, images, and media spring up as if by magic. This same magic is happening for billions of other people every day. They’re visiting the same websites, using the same applications—often getting media and text customized just for them.
And these billions of people are all using different types of devices and software applications, written by different developers at different (often competing!) companies.
Amazingly, there is no all-powerful governing body regulating the internet and coordinating its development with any sort of legal force. Instead, different parts of the internet are governed by several different organizations that evolved over time on a somewhat ad hoc and opt-in basis.
Of course, choosing not to opt into the standards that these organizations publish may result in your contributions to the internet simply...not working. If your website can’t be displayed in popular web browsers, people likely aren’t going to visit it. If the data your router is sending can’t be interpreted by any other router, that data will be ignored.
Web scraping is, essentially, the practice of substituting a web browser for an application of your own design. Because of this, it’s important to understand the standards and frameworks that web browsers are built on. As a web scraper, you must both mimic and, at times, subvert the expected internet customs and practices.
Networking
In the early days of the telephone system, each telephone was connected by a physical wire to a central switchboard. If you wanted to make a call to a nearby friend, you picked up the phone, asked the switchboard operator to connect you, and the switchboard operator physically created (via plugs and jacks) a dedicated connection between your phone and your friend’s phone.
Long-distance calls were expensive and could take minutes to connect. Placing a long-distance call from Boston to Seattle would result in the coordination of switchboard operators across the United States creating a single enormous length of wire directly connecting your phone to the recipient’s.
Today, rather than make a telephone call over a temporary dedicated connection, we can make a video call from our house to anywhere in the world across a persistent web of wires. The wire doesn’t tell the data where to go, the data guides itself, in a process called packet switching. Although many technologies over the years contributed to what we think of as “the internet,” packet switching is really the technology that single-handedly started it all.
In a packet-switched network, the message to be sent is divided into discrete ordered packets, each with its own sender and destination address. These packets are routed dynamically to any destination on the network, based on that address. Rather than being forced to blindly traverse the single dedicated connection from receiver to sender, the packets can take any path the network chooses. In fact, packets in the same message transmission might take different routes across the network and be reordered by the receiving computer when they arrive.
If the old phone networks were like a zip line—taking passengers from a single destination at the top of a hill to a single destination at the bottom—then packet-switched networks are like a highway system, where cars going to and from multiple destinations are all able to use the same roads.
A modern packet-switching network is usually described using the Open Systems Interconnection (OSI) model, which is composed of seven layers of routing, encoding, and error handling:
- Physical layer
- Data link layer
- Network layer
- Transport layer
- Session layer
- Presentation layer
- Application layer
Most web application developers spend their days entirely in layer 7, the application layer. This is also the layer where the most time is spent in this book. However, it is important to have at least conceptual knowledge of the other layers when scraping the web. For example, TLS fingerprinting, discussed in Chapter 17, is a web scraping detection method that involves the transport layer.
In addition, knowing about all of the layers of data encapsulation and transmission can help troubleshoot errors in your web applications and web scrapers.
Physical Layer
The physical layer specifies how information is physically transmitted with electricity over the Ethernet wire in your house (or on any local network). It defines things like the voltage levels that encode 1’s and 0’s, and how fast those voltages can be pulsed. It also defines how radio waves over Bluetooth and WiFi are interpreted.
This layer does not involve any programming or digital instructions but is based purely on physics and electrical standards.
Data Link Layer
The data link layer specifies how information is transmitted between two nodes in a local network, for example, between your computer and a router. It defines the beginning and ending of a single transmission and provides for error correction if the transmission is lost or garbled.
At this layer, the packets are wrapped in an additional “digital envelope” containing routing information and are referred to as frames. When the information in the frame is no longer needed, it is unwrapped and sent across the network as a packet.
It’s important to note that, at the data link layer, all devices on a network are receiving the same data at all times—there’s no actual “switching” or control over where the data is going. However, devices that the data is not addressed to will generally ignore the data and wait until they get something that’s meant for them.
Network Layer
The network layer is where packet switching, and therefore “the internet,” happens. This is the layer that allows packets from your computer to be forwarded by a router and reach devices beyond their immediate network.
The network layer involves the Internet Protocol (IP) part of the Transmission Control Protocol/Internet Protocol (TCP/IP). IP is where we get IP addresses from. For instance, my IP address on the global internet is currently 173.48.178.92. This allows any computer in the world to send data to me and for me to send data to any other address from my own address.
Transport Layer
Layer 4, the transport layer, concerns itself with connecting a specific service or application running on a computer to a specific application running on another computer, rather than just connecting the computers themselves. It’s also responsible for any error correction or retrying needed in the stream of data.
TCP, for example, is very picky and will keep requesting any missing packets until all of them are correctly received. TCP is often used for file transfers, where all packets must be correctly received in the right order for the file to work.
In contrast, the User Datagram Protocol (UDP) will happily skip over missing packets in order to keep the data streaming in. It’s often used for videoconferencing or audioconferencing, where a temporary drop in transmission quality is preferable to a lag in the conversation.
Because different applications on your computer can have different data reliability needs at the same time (for instance, making a phone call while downloading a file), the transport layer is also where the port number comes in. The operating system assigns each application or service running on your computer to a specific port, from where it sends and receives data.
This port is often written as a number after the IP address, separated by a colon. For example, 71.245.238.173:8080 indicates the application assigned by the operating system to port 8080 on the computer assigned by the network at IP address 71.245.238.173.
Session Layer
The session layer is responsible for opening and closing a session between two applications. This session allows stateful information about what data has and hasn’t been sent, and who the computer is communicating with. The session generally stays open for as long as it takes to complete the data request, and then closes.
The session layer allows for retrying a transmission in case of a brief crash or disconnect.
Sessions Versus Sessions
Sessions in the session layer of the OSI model are different from sessions and session data that web developers usually talk about. Session variables in a web application are a concept in the application layer that are implemented by the web browser software.
Session variables, in the application layer, stay in the browser for as long as they need to or until the user closes the browser window. In the session layer of the OSI model, the session usually only lasts for as long as it takes to transmit a single file!
Presentation Layer
The presentation layer transforms incoming data from character strings into a format that the application can understand and use. It is also responsible for character encoding and data compression. The presentation layer cares about whether incoming data received by the application represents a PNG file or an HTML file, and hands this file to the application layer accordingly.
Application Layer
The application layer interprets the data encoded by the presentation layer and uses it appropriately for the application. I like to think of the presentation layer as being concerned with transforming and identifying things, while the application layer is concerned with “doing” things. For instance, HTTP with its methods and statuses is an application layer protocol. The more banal JSON and HTML (because they are file types that define how data is encoded) are presentation layer protocols.
HTML
The primary function of a web browser is to display HTML (HyperText Markup Language) documents. HTML documents are files that end in .html or, less frequently, .htm.
Like text files, HTML files are encoded with plain-text characters, usually ASCII (see “Text Encoding and the Global Internet”). This means that they can be opened and read with any text editor.
This is an example of a simple HTML file:
<html> <head> <title>A Simple Webpage</title> </head> <body> <!-- This comment text is not displayed in the browser --> <h1>Hello, World!</h1> </body> </html>
HTML files are a special type of XML (Extensible Markup Language) files. Each string beginning with a <
and ending with a >
is called a tag.
The XML standard defines the concept of opening or starting tags like <html>
and closing or ending tags that begin with a </
, like </html>
. Between the starting and ending tags is the content of the tags.
In the case where it’s unnecessary for tags to have any content at all, you may see a tag that acts as its own closing tag. This is called an empty element tag or a self-closing tag and looks like:
<p />
Tags can also have attributes in the form of attributeKey="attribute value"
, for example:
<div class="content"> Lorem ipsum dolor sit amet, consectetur adipiscing elit </div>
Here, the div
tag has the attribute class
which has the value main-content
.
An HTML element has a starting tag with some optional attributes, some content, and a closing tag. An element can also contain multiple other elements, in which case they are nested elements.
While XML defines these basic concepts of tags, content, attributes, and values, HTML defines what those tags can and can’t be, what they can and cannot contain, and how they must be interpreted and displayed by the browser.
For example, the HTML standard defines the usage of the class
attribute and the id
attribute, which are often used to organize and control the display of HTML elements:
<h1 id="main-title">Some Title</h1> <div class="content"> Lorem ipsum dolor sit amet, consectetur adipiscing elit </div>
As a rule, multiple elements on the page can contain the same class
value; however, any value in the id
field must be unique on that page. So multiple elements could have the class content
, but there can only be one element with the id main-title
.
How the elements in an HTML document are displayed in the web browser is entirely dependent on how the web browser, as a piece of software, is programmed. If one web browser is programmed to display an element differently than another web browser, this will result in inconsistent experiences for users of different web browsers.
For this reason, it’s important to coordinate exactly what the HTML tags are supposed to do and codify this into a single standard. The HTML standard is currently controlled by the World Wide Web Consortium (W3C). The current specification for all HTML tags can be found at https://html.spec.whatwg.org/multipage/.
However, the formal W3C HTML standard is probably not the best place to learn HTML if you’ve never encountered it. A large part of web scraping involves reading and interpreting raw HTML files found on the web. If you’ve never dealt with HTML before, I highly recommend a book like HTML & CSS: The Good Parts to get familiar with some of the more common HTML tags.
CSS
Cascading Style Sheets (CSS) define the appearance of HTML elements on a web page. CSS defines things like layout, colors, position, size, and other properties that transform a boring HTML page with browser-defined default styles into something more appealing for a modern web viewer.
Using the HTML example from earlier:
<html> <head> <title>A Simple Webpage</title> </head> <body> <!-- This comment text is not displayed in the browser --> <h1>Hello, World!</h1> </body> </html>
some corresponding CSS might be:
h1 { font-size: 20px; color: green; }
This CSS will set the h1
tag’s content font size to be 20 pixels and display it in green text.
The h1
part of this CSS is called the selector or the CSS selector. This CSS selector indicates that the CSS inside the curly braces will be applied to the content of any h1
tags.
CSS selectors can also be written to apply only to elements with certain class
or id
attributes. For example, using the HTML:
<h1 id="main-title">Some Title</h1> <div class="content"> Lorem ipsum dolor sit amet, consectetur adipiscing elit </div>
the corresponding CSS might be:
h1#main-title { font-size: 20px; } div.content { color: green; }
A #
is used to indicate the value of an id
attribute, and a .
is used to indicate the value of a class
attribute.
If it’s unimportant what the value of the tag is, the tag name can be omitted entirely. For instance, this CSS would turn the contents of any element having the class content green:
.content { color: green; }
CSS data can be contained either in the HTML itself or in a separate CSS file with a .css file extension. CSS in the HTML file is placed inside <style>
tags in the head of the HTML document:
<html> <head> <style> .content { color: green; } </style> ...
More commonly, you’ll see CSS being imported in the head of the document using the link
tag:
<html> <head> <link rel="stylesheet" href="mystyle.css"> ...
As a web scraper, you won’t often find yourself writing style sheets to make the HTML pretty. However, it is important to be able to read and recognize how an HTML page is being transformed by the CSS in order to relate what you’re seeing in your web browser to what you’re seeing in code.
For instance, you may be confused when an HTML element doesn’t appear on the page. When you read the element’s applied CSS, you see:
.mystery-element { display: none; }
This sets the display
attribute of the element to none
, hiding it from the page.
If you’ve never encountered CSS before, you likely won’t need to study it in any depth in order to scrape the web, but you should be comfortable with its syntax and note the CSS rules that are mentioned in this book.
JavaScript
When a client makes a request to a web server for a particular web page, the web server executes some code to create the web page that it sends back. This code, called server-side code, can be as simple as retrieving a static HTML file and sending it on. Or, it can be a complex application written in Python (the best language), Java, PHP, or any number of common server-side programming languages.
Ultimately, this server-side code creates some sort of stream of data that gets sent to the browser and displayed. But what if you want some type of interaction or behavior—a text change or a drag-and-drop element, for example—to happen without going back to the server to run more code? For this, you use client-side code.
Client-side code is any code that is sent over by a web server but actually executed by the client’s browser. In the olden days of the internet (pre-mid-2000s), client-side code was written in a number of languages. You may remember Java applets and Flash applications, for example. But JavaScript emerged as the lone option for client-side code for a simple reason: it was the only language supported by the browsers themselves, without the need to download and update separate software (like Adobe Flash Player) in order to run the programs.
JavaScript originated in the mid-90s as a new feature in Netscape Navigator. It was quickly adopted by Internet Explorer, making it the standard for both major web browsers at the time.
Despite the name, JavaScript has almost nothing to do with Java, the server-side programming language. Aside from a small handful of superficial syntactic similarities, they are extremely dissimilar languages.
In 1996, Netscape (the creator of JavaScript) and Sun Microsystems (the creator of Java) did a license agreement allowing Netscape to use the name “JavaScript,” anticipating some further collaboration between the two languages. However, this collaboration never happened, and it’s been a confusing misnomer ever since.
Although it had an uncertain start as a scripting language for a now-defunct web browser, JavaScript is now the most popular programming language in the world. This popularity is boosted by the fact that it can also be used server-side, using Node.js. But its popularity is certainly cemented by the fact that it’s the only client-side programming language available.
JavaScript is embedded into HTML pages using the <script>
tag. The JavaScript code can be inserted as content:
<script> alert('Hello, world!'); </script>
Or it can be referenced in a separate file using the src
attribute:
<script src="someprogram.js"></script>
Unlike HTML and CSS, you likely won’t need to read or write JavaScript while scraping the web, but it is handy to at least get a feel for what it looks like. It can sometimes contain useful data. For example:
<script> const data = '{"some": 1, "data": 2, "here": 3}'; </script>
Here, a JavaScript variable is being declared with the keyword const
(which stands for “constant”) and is being set to a JSON-formatted string containing some data, which can be parsed by a web scraper directly.
JSON (JavaScript Object Notation) is a text format that contains human-readable data, is easily parsed by web scrapers, and is ubiquitous on the web. I will discuss it further in Chapter 15.
You may also see JavaScript making a request to a different source entirely for data:
<script> fetch('http://example.com/data.json') .then((response) => { console.log(response.json()); }); </script>
Here, JavaScript is creating a request to http://example.com/data.json
and, after the response is received, logging it to the console (more about the “console” in the next section).
JavaScript was originally created to provide dynamic interactivity and animation in an otherwise static web. However, today, not all dynamic behavior is created by JavaScript. HTML and CSS also have some features that allow them to change the content on the page.
For example, CSS keyframe animation can allow elements to move, change color, change size, or undergo other transformations when the user clicks on or hovers over that element.
Recognizing how the (often literally) moving parts of a website are put together can help you avoid wild goose chases when you’re trying to locate data.
Watching Websites with Developer Tools
Like a jeweler’s loupe or a cardiologist’s stethoscope, your browser’s developer tools are essential to the practice of web scraping. To collect data from a website, you have to know how it’s put together. The developer tools show you just that.
Throughout this book, I will use developer tools as shown in Google Chrome. However, the developer tools in Firefox, Microsoft Edge, and other browsers are all very similar to each other.
To access the developer tools in your browser’s menu, use the following instructions:
Chrome
View→ Developer → Developer Tools
Safari
Safari → Preferences → Advanced → Check “Show Develop menu in menu bar”
Then, using the Develop menu: Develop → Show web inspector
Microsoft Edge
Using the menu: Tools → Developer → Developer Tools
Firefox
Tools → Browser Tools → Web Developer Tools
Across all browsers, the keyboard shortcut for opening the developer tools is the same, and depending on your operating system.
Mac
Option + Command + I
Windows
CTRL + Shift + I
When web scraping, you’ll likely spend most of your time in the Network tab (shown in Figure 1-1) and the Elements tab.
The Network tab shows all of the requests made by the page as the page is loading. If you’ve never used it before, you might be in for a surprise! It’s common for complex pages to make dozens or even hundreds of requests for assets as they’re loading. In some cases, the pages may even continue to make steady streams of requests for the duration of your stay on them. For instance, they may be sending data to action tracking software, or polling for updates.
Don’t See Anything in the Network Tab?
Note that the developer tools must be open while the page is making its requests in order for those requests to be captured. If you load a page without having the developer tab open, and then decide to inspect it by opening the developer tools, you may want to refresh the page to reload it and see the requests it is making.
If you click on a single network request in the Network tab, you’ll see all of the data associated with that request. The layout of this network request inspection tool differs slightly from browser to browser, but generally allows you to see:
- The URL the request was sent to
- The HTTP method used
- The response status
- All headers and cookies associated with the request
- The payload
- The response
This information is useful for writing web scrapers that replicate these requests in order to fetch the same data the page is fetching.
The Elements tab (see Figures 1-2 and 1-3) is used to examine the structure and contents of HTML files. It’s extremely handy for examining specific pieces of data on a page in order to locate the HTML tags surrounding that data and write scrapers to grab it.
As you hover over the text of each HTML element in the Elements tab, you’ll see the corresponding element on the page visually highlight in the browser. Using this tool is a great way to explore the pages and develop a deeper understanding of how they’re constructed (Figure 1-3).
You don’t need to be an expert on the internet, networking, or even programming to begin scraping the web. However, having a basic understanding of how the pieces fit together, and how your browser’s developer tools show those pieces, is essential.
Get Web Scraping with Python, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.