Chapter 4. Web Servers

This chapter will first extend our experience with writing basic TCP servers to the construction of basic HTTP servers. With that context and understanding of the HTTP protocol in hand, we’ll then abandon the low-level API in favor of the high-level twisted.web APIs used for constructing sophisticated web servers.

Note

Twisted Web is the Twisted subproject focusing on HTTP communication. It has robust HTTP 1.1 and HTTPS client and server implementations, proxy support, WSGI integration, basic HTML templating, and more.

Responding to HTTP Requests: A Low-Level Review

The HyperText Transfer Protocol (HTTP) is a request/response application-layer protocol, where requests are initiated by a client to a server, which responds with the requested resource. It is text-based and newline-delimited, and thus easy for humans to read.

To experiment with the HTTP protocol we’ll create a subclass of protocol.Protocol, the same class we used to build our echo servers and clients in Chapter 2. Our protocol will know how to accept a connection, process the request, and send back an HTTP-formatted response.

This section is intended as both a glimpse under the hood and a refresher on the HTTP protocol. When building real web servers, you’ll almost certainly use the higher-level twisted.web APIs Twisted provides. If you’d prefer to skip to that content, head over to Handling GET Requests.

The Structure of an HTTP Request

Every HTTP request starts with a single line containing the HTTP method, the path to the desired resource, and the HTTP version. Following this line are an arbitrary number of header lines. A blank line indicates the end of the headers. The header section is optionally followed by additional data called the body of the request, such as data being posted from an HTML form.

Here’s an example of a minimal HTTP request. This request asks the server to perform the method GET on the root resource / using HTTP version 1.1:

GET / HTTP/1.1
Host: www.example.com

We can emulate a web browser and make this HTTP GET request manually using the telnet utility (taking care to remember the newline after the headers):

$ telnet www.google.com 80
Trying 74.125.131.99...
Connected to www.l.google.com.
Escape character is '^]'.
GET / HTTP/1.1 Host: www.google.com

The server responds with a line containing the HTTP version used for the response and an HTTP status code. Like the request, the response contains header lines followed by a blank line and the message body. A minimal HTTP response might look like this:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 17
Connection: Close

Hello HTTP world!

www.google.com’s response is more complicated, since it is setting cookies and various security headers, but the format is the same.

To write our own HTTP server, we can implement a Protocol that parses newline-delimited input, parses out the headers, and returns an HTTP-formatted response. Example 4-1 shows a simple HTTP implementation that echoes each request back to the client.

Example 4-1. webecho.py
from twisted.protocols import basic
from twisted.internet import protocol, reactor

class HTTPEchoProtocol(basic.LineReceiver):
    def __init__(self):
        self.lines = []

    def lineReceived(self, line):
        self.lines.append(line)
        if not line:
            self.sendResponse()

    def sendResponse(self):
        self.sendLine("HTTP/1.1 200 OK")
        self.sendLine("")
        responseBody = "You said:\r\n\r\n" + "\r\n".join(self.lines)
        self.transport.write(responseBody)
        self.transport.loseConnection()

class HTTPEchoFactory(protocol.ServerFactory):
    def buildProtocol(self, addr):
        return HTTPEchoProtocol()

reactor.listenTCP(8000, HTTPEchoFactory())
reactor.run()

As with our basic TCP servers from Chapter 2, we create a protocol factory, HTTPEchoFactory, inheriting from protocol.ServerFactory. It builds instances of our HTTPEchoProtocol, which inherits from basic.LineReceiver so we don’t have to write our own boilerplate code for handling newline-delimited protocols.

We keep track of lines as they are received in lineReceived until we reach an empty line, the carriage return and line feed (\r\n) marking the end of the headers sent by the client. We then echo back the request text and terminate the connection.

HTTP uses TCP as its transport-layer protocol, so we use listenTCP to register callbacks with the reactor to get notified when TCP packets containing our HTTP data arrive on our designated port.

We can start this web server with python webecho.py then interact with the server through telnet or a web browser.

Using telnet, the communication will look something like:

$ telnet localhost 8000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1 Host: localhost:8000 X-Header: "My test header"

HTTP/1.1 200 OK

You said:

GET / HTTP/1.1
Host: localhost:8000
X-Header: "My test header"
Connection closed by foreign host.

It’s interesting to see what extra information your browser adds when making HTTP requests. To send a request to the server from a browser, visit http://localhost:8000.

Figure 4-1 shows what I get when I make this request from Chrome on my MacBook.

Browser GET request
Figure 4-1. Browser GET request

By default, Chrome is telling websites about my operating system and browser and that I browse in English, as well as passing other headers specifying properties for the response.

Parsing HTTP Requests

The HTTPEchoProtocol class in Example 4-1 understands the structure of an HTTP request, but it doesn’t know how to parse the request and respond with the resource being requested. To do this, we’ll need to make our first foray into twisted.web.

An HTTP request is represented by twisted.web.http.Request. We can specify how requests are processed by subclassing http.Request and overriding its process method. Example 4-2 subclasses http.Request to serve one of three resources: an HTML page for the root resource /, a page for /about, and a 404 http.NOT_FOUND if any other path is specified.

Example 4-2. requesthandler.py
from twisted.internet import reactor
from twisted.web import http

class MyRequestHandler(http.Request):
    resources = {
        '/': '<h1>Home</h1>Home page',
        '/about': '<h1>About</h1>All about me',
        }

    def process(self):
        self.setHeader('Content-Type', 'text/html')
        if self.resources.has_key(self.path):
            self.write(self.resources[self.path])
        else:
            self.setResponseCode(http.NOT_FOUND)
            self.write("<h1>Not Found</h1>Sorry, no such resource.")
        self.finish()

class MyHTTP(http.HTTPChannel):
    requestFactory = MyRequestHandler

class MyHTTPFactory(http.HTTPFactory):
    def buildProtocol(self, addr):
        return MyHTTP()

reactor.listenTCP(8000, MyHTTPFactory())
reactor.run()

As always, we register a factory that generates instances of our protocol with the reactor. In this case, instead of subclassing protocol.Protocol directly, we are taking advantage of a higher-level API, http.HTTPChannel, which inherits from basic.LineReceiver and already understands the structure of an HTTP request and the numerous behaviors required by the HTTP RFCs.

Our MyHTTP protocol specifies how to process requests by setting its requestFactory instance variable to MyRequestHander, which subclasses http.Request. Request’s process method is a noop that must be overridden in subclasses, which we do here. The HTTP response code is 200 unless overridden with setResponseCode, as we do to send a 404 http.NOT_FOUND when an unknown resource is requested.

To test this server, run python requesthandler.py; this will start up the web server on port 8000. You can then test accessing the supported resources, http://localhost:8000/ and http://localhost:8000/about, and an unsupported resource like http://localhost:8000/foo.

Handling GET Requests

Now that we have a good grasp of the structure of the HTTP protocol and how the low-level APIs work, we can move up to the high-level APIs in twisted.web.server that facilitate the construction of more sophisticated web servers.

Serving Static Content

A common task for a web server is to be able to serve static content out of some directory. Example 4-3 shows a basic implementation.

Example 4-3. static_content.py
from twisted.internet import reactor
from twisted.web.server import Site
from twisted.web.static import File

resource = File('/var/www/mysite')
factory = Site(resource)
reactor.listenTCP(8000, factory)
reactor.run()

At this level we no longer have to worry about HTTP protocol details. Instead, we use a Site, which subclasses http.HTTPFactory and manages HTTP sessions and dispatching to resources for us. A Site is initialized with the resource to which it is managing access.

A resource must provide the IResource interface, which describes how the resource gets rendered and how child resources in the resource hierarchy are added and accessed. In this case, we initialize our Site with a File resource representing a regular, non-interpreted file.

Tip

twisted.web contains implementations for many common resources. Besides File, available resources include a customizable DirectoryListing and ErrorPage, a ProxyResource that renders results retrieved from another server, and an XMLRPC implementation.

The Site is registered with the reactor, which will then listen for requests on port 8000.

After starting the web server with python static_content.py, we can visit http://localhost:8000 in a web browser. The server serves up a directory listing for all of the files in /var/www/mysite/ (replace that path with a valid path to a directory on your system).

Static URL dispatch

What if you’d like to serve different content at different URLs?

We can create a hierarchy of resources to serve at different URLs by registering Resources as children of the root resource using its putChild method. Example 4-4 demonstrates this static URL dispatch.

Example 4-4. static_dispatch.py
from twisted.internet import reactor
from twisted.web.server import Site
from twisted.web.static import File

root = File('/var/www/mysite')
root.putChild("doc", File("/usr/share/doc"))
root.putChild("logs", File("/var/log/mysitelogs"))
factory = Site(root)
reactor.listenTCP(8000, factory)
reactor.run()

Now, visiting http://localhost:8000/ in a web browser will serve content from /var/www/mysite, http://localhost:8000/doc will serve content from /usr/share/doc, and http://localhost:8000/logs/ will serve content from /var/log/mysitelogs.

These Resource hierarchies can be extended to arbitrary depths by registering child resources with existing resources in the hierarchy.

Serving Dynamic Content

Serving dynamic content looks very similar to serving static content—the big difference is that instead of using an existing Resource, like File, you’ll subclass Resource to define the new dynamic resource you want a Site to serve.

Example 4-5 implements a simple clock page that displays the local time when you visit any URL.

Example 4-5. dynamic_content.py
from twisted.internet import reactor
from twisted.web.resource import Resource
from twisted.web.server import Site

import time

class ClockPage(Resource):
    isLeaf = True
    def render_GET(self, request):
        return "The local time is %s" % (time.ctime(),)

resource = ClockPage()
factory = Site(resource)
reactor.listenTCP(8000, factory)
reactor.run()

ClockPage is a subclass of Resource. We implement a render_ method for every HTTP method we want to support; in this case we only care about supporting GET requests, so render_GET is all we implement. If we were to POST to this web server, we’d get a 405 Method Not Allowed unless we also implemented render_POST.

The rendering method is passed the request made by the client. This is not an instance of twisted.web.http.Request, as in Example 4-2; it is instead an instance of twisted.web.server.Request, which subclasses http.Request and understands application-layer ideas like session management and rendering.

render_GET returns whatever we want served as a response to a GET request. In this case, we return a string containing the local time. If we start our server with python dynamic_content.py, we can visit any URL on http://localhost:8000 with a web browser and see the local time displayed and updated as we reload.

The isLeaf instance variable describes whether or not a resource will have children. Without more work on our part (as demonstrated in Example 4-6), only leaf resources get rendered; if we set isLeaf to False and restart the server, attempting to view any URL will produce a 404 No Such Resource.

Dynamic Dispatch

We know how to serve static and dynamic content. The next step is to be able to respond to requests dynamically, serving different resources based on the URL.

Example 4-6 demonstrates a calendar server that displays the calendar for the year provided in the URL. For example, visiting http://localhost:8000/2013 will display the calendar for 2013, as shown in Figure 4-2.

Example 4-6. dynamic_dispatch.py
from twisted.internet import reactor
from twisted.web.resource import Resource, NoResource
from twisted.web.server import Site

from calendar import calendar

class YearPage(Resource):
    def __init__(self, year):
        Resource.__init__(self)
        self.year = year

    def render_GET(self, request):
        return "<html><body><pre>%s</pre></body></html>" % (calendar(self.year),)

class CalendarHome(Resource):
    def getChild(self, name, request):
        if name == '':
            return self
        if name.isdigit():
            return YearPage(int(name))
        else:
            return NoResource()

    def render_GET(self, request):
        return "<html><body>Welcome to the calendar server!</body></html>"

root = CalendarHome()
factory = Site(root)
reactor.listenTCP(8000, factory)
reactor.run()
Calendar
Figure 4-2. Calendar

This example has the same structure as Example 4-3. A TCP server is started on port 8000, serving the content registered with a Site, which is a subclass of twisted.web.http.HTTPFactory and knows how to manage access to resources.

The root resource is CalendarHome, which subclasses Resource to specify how to look up child resources and how to render itself.

CalendarHome.getChild describes how to traverse a URL from left to right until we get a renderable resource. If there is no additional component to the requested URL (i.e., the request was for / ), CalendarHome returns itself to be rendered by invoking its render_GET method. If the URL has an additional component to its path that is an integer, an instance of YearPage is rendered. If that path component couldn’t be converted to a number, an instance of twisted.web.error.NoResource is returned instead, which will render a generic 404 page.

There are a few subtle points to this example that deserve highlighting.

Creating resources that are both renderable and have children

Note that CalendarHome does not set isLeaf to True, and yet it is still rendered when we visit http://localhost:8000.

In general, only resources that are leaves are rendered; this can be because isLeaf is set to True or because when traversing the resource hierarchy, that resource is where we are when the URL runs out. However, when isLeaf is True for a resource, its getChild method is never called. Thus, for resources that have children, isLeaf cannot be set to True.

If we want CalendarHome to both get rendered and have children, we must override its getChild method to dictate resource generation.

In CalendarHome.getChild, if name == '' (i.e., if we are requesting the root resource), we return ourself to get rendered. Without that if condition, visiting http://localhost:8000 would produce a 404.

Similarly, YearPage does not have isLeaf set to True. That means that when we visit http://localhost:8000/2013, we get a rendered calendar because 2013 is at the end of the URL, but if we visit http://localhost:8000/2013/foo, we get a 404.

If we want http://localhost:8000/2013/foo to generate a calendar just like http://localhost:8000/2013, we need to set isLeaf to True or have YearPage override getChild to return itself, like we do in CalendarHome.

Redirects

In Example 4-6, visiting http://localhost:8000 produced a welcome page. What if we wanted http://localhost:8000 to instead redirect to the calendar for the current year?

In the relevant render method (e.g., render_GET), instead of rendering the resource at a given URL, we need to construct a redirect with twisted.web.util.redirectTo. redirectTo takes as arguments the URL component to which to redirect, and the request, which still needs to be rendered.

Example 4-7 shows a revised CalenderHome.render_GET that redirects to the URL for the current year’s calendar (e.g., http://localhost:8000/2013) upon requesting the root resource at http://localhost:8000.

Example 4-7. redirectTo
from datetime import datetime
from twisted.web.util import redirectTo

def render_GET(self, request):
    return redirectTo(datetime.now().year, request)

Handling POST Requests

To handle POST requests, implement a render_POST method in your Resource.

A Minimal POST Example

Example 4-8 serves a page where users can fill out and submit to the web server the contents of a text box. The server will then display that text back to the user.

Example 4-8. handle_post.py
from twisted.internet import reactor
from twisted.web.resource import Resource
from twisted.web.server import Site

import cgi

class FormPage(Resource):
    isLeaf = True
    def render_GET(self, request):
        return """
<html>
 <body>
  <form method="POST">
   <input name="form-field" type="text" />
   <input type="submit" />
   </form>
   </body>
   </html>
"""

    def render_POST(self, request):
        return """
<html>
 <body>You submitted: %s</body>
 </html>
""" % (cgi.escape(request.args["form-field"][0]),)

factory = Site(FormPage())
reactor.listenTCP(8000, factory)
reactor.run()

The FormPage Resource in handle_post.py implements both render_GET and render_POST methods.

render_GET returns the HTML for a blank page with a text box called "form-field". When a visitor visits http://localhost:8000, she will see this form.

render_POST extracts the text inputted by the user from request.args, sanitizes it with cgi.escape, and returns HTML displaying what the user submitted.

Asynchronous Responses

In all of the Twisted web server examples up to this point, we have assumed that the server can instantaneously respond to clients without having to first retrieve an expensive resource (say, from a database query) or do expensive computation. What happens when responding to a request blocks?

Example 4-9 implements a dummy BusyPage resource that sleeps for five seconds before returning a response to the request.

Example 4-9. blocking.py
from twisted.internet import reactor
from twisted.web.resource import Resource
from twisted.web.server import Site

import time

class BusyPage(Resource):
    isLeaf = True
    def render_GET(self, request):
        time.sleep(5)
        return "Finally done, at %s" % (time.asctime(),)


factory = Site(BusyPage())
reactor.listenTCP(8000, factory)
reactor.run()

If you run this server and then load http://localhost:8000 in several browser tabs in quick succession, you’ll observe that the last page to load will load N*5 seconds after the first page request, where N is the number of requests to the server. In other words, the requests are processed serially.

This is terrible performance! We need our web server to be responding to other requests while an expensive resource is being processed.

One of the great properties of this asynchronous framework is that we can achieve the responsiveness that we want without introducing threads by using the Deferred API we already know and love.

Example 4-10 demonstrates how to use a Deferred instead of blocking on an expensive resource. deferLater replaces the blocking time.sleep(5) with a Deferred that will fire after five seconds, with a callback to _delayedRender to finish the request when the fake resource becomes available. Then, instead of waiting on that resource, render_GET returns NOT_DONE_YET immediately, freeing up the web server to process other requests.

Example 4-10. non_blocking.py
from twisted.internet import reactor
from twisted.internet.task import deferLater
from twisted.web.resource import Resource
from twisted.web.server import Site, NOT_DONE_YET

import time

class BusyPage(Resource):
    isLeaf = True

    def _delayedRender(self, request):
        request.write("Finally done, at %s" % (time.asctime(),))
        request.finish()

    def render_GET(self, request):
        d = deferLater(reactor, 5, lambda: request)
        d.addCallback(self._delayedRender)
        return NOT_DONE_YET

factory = Site(BusyPage())
reactor.listenTCP(8000, factory)
reactor.run()

Tip

If you run Example 4-10 and then load multiple instances of http://localhost:8000 in a browser, you may still find that the requests are processed serially. This is not Twisted’s fault: some browsers, notably Chrome, serialize requests to the same resource. You can verify that the web server isn’t blocking by issuing several simultaneous requests through cURL or a quick Python script.

More Practice and Next Steps

This chapter introduced Twisted HTTP servers, from the lowest-level APIs up through twisted.web.server. We saw examples of serving static and dynamic content, handling GET and POST requests, and how to keep our servers responsive with asynchronous responses using Deferreds.

The Twisted Web HOWTO index has several in-depth tutorials related to HTTP servers, including on deployment and templating. This page is an excellent series of short, self-contained examples of Twisted Web concepts.

The Twisted Web examples directory has a variety of server examples, including examples for proxies, an XML-RPC server, and rendering the output of a server process.

Twisted is not a “web framework” like Django, web.py, or Flask. However, one of its many roles is as a framework for building frameworks! An example of this is the Klein micro-web framework, which you can also browse and download at that GitHub page.

Get Twisted Network Programming Essentials, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.