August 2014
Beginner to intermediate
304 pages
7h 10m
English
Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.
You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.
We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:
>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'The result is much cleaner and easier ...
Read now
Unlock full access