Converting HTML to ASCII
Problem
You want to convert an HTML file into formatted plain ASCII.
Solution
If you have an external formatter like lynx, call an external program:
$ascii = `lynx -dump $filename`;
If you want to do it within your program and don’t care about the things that the HTML::TreeBuilder formatter doesn’t yet handle (tables and frames):
use HTML::FormatText; use HTML::Parse; $html = parse_htmlfile($filename); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); $ascii = $formatter->format($html);
Discussion
These examples both assume you have the HTML text in a file. If your
HTML is in a variable, you need to write it to a file for
lynx to read. If you are using
HTML::FormatText, use the HTML::TreeBuilder module:
use HTML::TreeBuilder; use HTML::FormatText; $html = HTML::TreeBuilder->new(); $html->parse($document); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); $ascii = $formatter->format($html);
If you use Netscape, its ``Save as'' option with the type set to “Text” does the best job with tables.
See Also
The documentation for the CPAN modules HTML::Parse,
HTML::TreeBuilder, and HTML::FormatText; your system’s
lynx (1) manpage; Section 20.6
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access