Capturing Text Inside HTML Tags (PHP Cookbook)

13.8.3. Discussion

True parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it's significantly easier to validate and parse.

For instance, the pattern in the Solution is smart enough to find only matching headings, so <h1>Dr. Strangelove<h1> is okay, because it's wrapped inside <h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is an <h2> while the closing tag is not.

This technique also works for finding all text inside bold and italic tags:

$html = join('',file($file));
preg_match('#<([bi])>(.+?)</\1>#is', $html, $matches);

However, it breaks on nested headings. Using that regular expression on:

<b>Dr. Strangelove or: <i>How I Learned to Stop Worrying and Love the Bomb</i></b>

doesn't capture the text inside the <i> tags as a separate item.

This wasn't a problem earlier; because headings are block level elements, it's illegal to nest them. However, as inline elements, nested bold and italic tags are valid.

Captured text can be processed by looping through the array of matches. For example, this code parses a document for its headings and pretty-prints them with indentation according to the heading level:

$html = join('',file($file));
preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);

for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
  print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}

So, with one representation of this recipe in HTML:

$html =<<<_END_
<h1>PHP Cookbook</h1>

Other Chapters
<h2>Regular Expressions</h2>

Other Recipes
<h3>Capturing Text Inside of HTML Tags</h3>

<h4>Problem</h4>
<h4>Solution</h4>
<h4>Discussion</h4>
<h4>See Also</h4>

_END_;

preg_match_all('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);

for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
  print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}

You get:

PHP Cookbook
  Regular Expressions
    Capturing Text Inside of HTML Tags
      Problem
      Solution
      Discussion
      See Also

By capturing the heading level and heading text separately, you can directly access the level and treat it as an integer when calculating the indentation size. To avoid a two-space indent for all lines, subtract 1 from the level.

13.8. Capturing Text Inside HTML Tags

13.8.1. Problem

13.8.2. Solution

13.8.3. Discussion

13.8.4. See Also


13.7. Finding All Lines in a File That Match a Pattern		13.9. Escaping Special Characters in a Regular Expression