You want to capture text inside HTML tags. For example, you want to find all the headings in a HTML document.
Read the HTML file into a string and use nongreedy matching in your pattern:
$html = join('',file($file)); preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);
In this example, $matches[2] contains an array of captured headings.
True parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it's significantly easier to validate and parse.
For instance, the pattern in the Solution is smart enough to find only matching headings, so <h1>Dr. Strangelove<h1> is okay, because it's wrapped inside <h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is an <h2> while the closing tag is not.
This technique also works for finding all text inside bold and italic tags:
$html = join('',file($file)); preg_match('#<([bi])>(.+?)</\1>#is', $html, $matches);
However, it breaks on nested headings. Using that regular expression on:
<b>Dr. Strangelove or: <i>How I Learned to Stop Worrying and Love the Bomb</i></b>
doesn't capture the text inside the <i> tags as a separate item.
This wasn't a problem earlier; because headings are block level elements, it's illegal to nest them. However, as inline elements, nested bold and italic tags are valid.
Captured text can be processed by looping through the array of matches. For example, this code parses a document for its headings and pretty-prints them with indentation according to the heading level:
$html = join('',file($file)); preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches); for ($i = 0, $j = count($matches[0]); $i < $j; $i++) { print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n"; }
So, with one representation of this recipe in HTML:
$html =<<<_END_ <h1>PHP Cookbook</h1> Other Chapters <h2>Regular Expressions</h2> Other Recipes <h3>Capturing Text Inside of HTML Tags</h3> <h4>Problem</h4> <h4>Solution</h4> <h4>Discussion</h4> <h4>See Also</h4> _END_; preg_match_all('#<h([1-6])>(.+?)</h\1>#is', $html, $matches); for ($i = 0, $j = count($matches[0]); $i < $j; $i++) { print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n"; }
You get:
PHP Cookbook Regular Expressions Capturing Text Inside of HTML Tags Problem Solution Discussion See Also
By capturing the heading level and heading text separately, you can directly access the level and treat it as an integer when calculating the indentation size. To avoid a two-space indent for all lines, subtract 1 from the level.
Recipe 11.8 for information on marking up a web page and Recipe 11.9 for extracting links from an HTML file; documentation on preg_match( ) at http://www.php.net/preg-match and str_repeat( ) at http://www.php.net/str-repeat.
Copyright © 2003 O'Reilly & Associates. All rights reserved.