If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:
$file = escapeshellarg($file); $ascii = `lynx -dump $file`;
If you can't use an external formatter, the pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables or frames, though).
function pc_html2ascii($s) { // convert links $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i', '$2 ($1)', $s); // convert <br>, <hr>, <p>, <div> to line breaks $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s); $s = preg_replace('@<p[^>]*>@i',"\n\n",$s); $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s); // convert bold and italic $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s); $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s); // decode named entities $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES))); // decode numbered entities $s = preg_replace('//e','chr(\\1)',$s); // remove any remaining tags $s = strip_tags($s); return $s; }
Recipe 9.9 for more on get_html_translation_table(); documentation on preg_replace( ) at http://www.php.net/preg-replace, get_html_translation_table( ) at http://www.php.net/get-html-translation-table, and strip_tags( ) at http://www.php.net/strip-tags.
Copyright © 2003 O'Reilly & Associates. All rights reserved.