November 2002
Intermediate to advanced
640 pages
16h 33m
English
You need to convert HTML to readable, formatted ASCII text.
If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:
$file = escapeshellarg($file); $ascii = `lynx -dump $file`;
If you can’t use an external formatter, the
pc_html2ascii( )
function shown in Example 11-4 handles a reasonable subset of HTML (no tables
or frames, though).
Example 11-4. pc_html2ascii( )
function pc_html2ascii($s) {
// convert links
$s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
'$2 ($1)', $s);
// convert <br>, <hr>, <p>, <div> to line breaks
$s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
$s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
$s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
// convert bold and italic
$s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
$s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);
// decode named entities
$s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
// decode numbered entities
$s = preg_replace('//e','chr(\\1)',$s);
// remove any remaining tags
$s = strip_tags($s);
return $s;
}
Recipe 9.9 for more on
get_html_translation_table(); documentation on
preg_replace( ) at
http://www.php.net/preg-replace,
get_html_translation_table( ) at
http://www.php.net/get-html-translation-table,
and strip_tags( ) at
http://www.php.net/strip-tags.