August 2014
Beginner to intermediate
304 pages
7h 10m
English
A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.
lxml is a Python binding for the C libraries libxml2 and libxslt. This makes it a very fast XML and HTML parsing library, while still being Pythonic. But that also means you need to install the C libraries for it to work. Installation instructions are available at http://lxml.de/installation.html. But if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml. You ...
Read now
Unlock full access