Grabbing a Document from the Web

Credit: Gisle Aas

Problem

You need to grab a document from a URL on the Web.

Solution

urllib.urlopen returns a file-like object, and you can call read on it:

from urllib import urlopen

doc = urlopen("http://www.python.org").read(  )
print doc

Discussion

Once you obtain a file-like object from urlopen, you can read it all at once into one big string by calling its read method, as I do in this recipe. Alternatively, you can read it as a list of lines by calling its readlines method or, for special purposes, just get one line at a time by calling its readline method in a loop. In addition to these file-like operations, the object that urlopen returns offers a few other useful features. For example, the following snippet gives you the headers of the document:

doc = urlopen("http://www.python.org")
print doc.info(  )

such as the Content-Type: header (text/html in this case) that defines the MIME type of the document. doc.info returns a mimetools.Message instance, so you can access it in various ways without printing it or otherwise transforming it into a string. For example, doc.info( ).getheader('Content-Type') returns the 'text/html' string. The maintype attribute of the mimetools.Message object is the 'text' string, subtype is the 'html' string, and type is also the 'text/html' string. If you need to perform sophisticated analysis and processing, all the tools you need are right there. At the same time, if your needs are simpler, you can meet them in very simple ways, as this recipe shows.

See Also

Documentation for the standard library modules urllib and mimetools in the Library Reference.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.