Skip to Main Content
Perl Cookbook
book

Perl Cookbook

by Tom Christiansen, Nathan Torkington
August 1998
Intermediate to advanced content levelIntermediate to advanced
800 pages
39h 20m
English
O'Reilly Media, Inc.
Content preview from Perl Cookbook

Converting HTML to ASCII

Problem

You want to convert an HTML file into formatted plain ASCII.

Solution

If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;

If you want to do it within your program and don’t care about the things that the HTML::TreeBuilder formatter doesn’t yet handle (tables and frames):

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($filename);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
$ascii = $formatter->format($html);

Discussion

These examples both assume you have the HTML text in a file. If your HTML is in a variable, you need to write it to a file for lynx to read. If you are using HTML::FormatText, use the HTML::TreeBuilder module:

use HTML::TreeBuilder;
use HTML::FormatText;

$html = HTML::TreeBuilder->new();
$html->parse($document);

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$ascii = $formatter->format($html);

If you use Netscape, its ``Save as'' option with the type set to “Text” does the best job with tables.

See Also

The documentation for the CPAN modules HTML::Parse, HTML::TreeBuilder, and HTML::FormatText; your system’s lynx (1) manpage; Section 20.6

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Perl in a Nutshell

Perl in a Nutshell

Nathan Patwardhan, Ellen Siever, Stephen Spainhour
Perl Best Practices

Perl Best Practices

Damian Conway
Mastering Perl

Mastering Perl

brian d foy
Perl Cookbook, 2nd Edition

Perl Cookbook, 2nd Edition

Tom Christiansen, Nathan Torkington

Publisher Resources

ISBN: 1565922433Supplemental ContentCatalog PageErrata