Parsing HTML with lxml

The lxml parser (https://lxml.de ) is the main module for analysis of XML documents and libxslt.

The main module features are as follows:

  • Support for XML and HTML
  • An API based on ElementTree
  • Support to selected elements of the document through XPath expressions

The installation of the XML parser can be done through the official repository:

pip install lxml

lxml.etree is a submodule within the lxml library that provides methods such as XPath(), which supports expressions with XPath selector syntax. With this example, we see the use of the parser to read an HTML file and extract the text from the title tag through an XPath expression:

from lxml import html,etreesimple_page = open('data/simple.html').read()parser = etree.HTML(simple_page) ...

Get Learning Python Networking - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.