Perl Cookbook

Perl CookbookSearch this book
Previous: 6.4.  Commenting Regular ExpressionsChapter 6
Pattern Matching
Next: 6.6. Matching Multiple Lines
 

6.5. Finding the Nth Occurrence of a Match

Problem

You want to find the N th match in a string, not just the first one. For example, you'd like to find the word preceding the third occurrence of "fish":

One fish two fish red fish blue fish

Solution

Use the /g modifier in a while loop, keeping count of matches:

$WANT = 3;
$count = 0;
while (/(\w+)\s+fish\b/gi) {
    if (++$count == $WANT) {
        print "The third fish is a $1 one.\n";
        # Warning: don't `last' out of this loop
    }
}
The third fish is a red one.

Or use a repetition count and repeated pattern like this:

/(?:\w+\s+fish\s+){2}(\w+)\s+fish/i;

Discussion

As explained in the chapter introduction, using the /g modifier in scalar context creates something of a progressive match, useful in while loops. This is commonly used to count the number of times a pattern matches in a string:

# simple way with while loop
$count = 0;
while ($string =~ /PAT/g) {
    $count++;               # or whatever you'd like to do here
}

# same thing with trailing while
$count = 0;
$count++ while $string =~ /PAT/g;

# or with for loop
for ($count = 0; $string =~ /PAT/g; $count++) { }
    
# Similar, but this time count overlapping matches
$count++ while $string =~ /(?=PAT)/g;

To find the Nth match, it's easiest to keep your own counter. When you reach the appropriate N, do whatever you care to. A similar technique could be used to find every Nth match by checking for multiples of N using the modulus operator. For example, (++$count % 3) == 0 would be every third match.

If this is too much bother, you can always extract all matches and then hunt for the ones you'd like.

$pond  = 'One fish two fish red fish blue fish';

# using a temporary
@colors = ($pond =~ /(\w+)\s+fish\b/gi);      # get all matches
$color  = $colors[2];                         # then the one we want

# or without a temporary array
$color = ( $pond =~ /(\w+)\s+fish\b/gi )[2];  # just grab element 3

print "The third fish in the pond is $color.\n";
The third fish in the pond is red.

Or finding all even-numbered fish:

$count = 0;
$_ = 'One fish two fish red fish blue fish';
@evens = grep { $count++ % 2 == 1 } /(\w+)\s+fish\b/gi;
print "Even numbered fish are @evens.\n";
Even numbered fish are two blue.

For substitution, the replacement value should be a code expression that returns the proper string. Make sure to return the original as a replacement string for the cases you aren't interested in changing. Here we fish out the fourth specimen and turn it into a snack:

$count = 0;
s{
   \b               # makes next \w more efficient
   ( \w+ )          # this is what we'll be changing
   (
     \s+ fish \b
   )
}{
    if (++$count == 4) {
        "sushi" . $2;
    } else {
         $1   . $2;
    }
}gex;
One fish two fish red fish sushi fish

Picking out the last match instead of the first one is a fairly common task. The easiest way is to skip the beginning part greedily. After /.*\b(\w+)\s+fish\b/, for example, the $1 variable would have the last fish.

Another way to get arbitrary counts is to make a global match in list context to produce all hits, then extract the desired element of that list:

$pond = 'One fish two fish red fish blue fish swim here.';
$color = ( $pond =~ /\b(\w+)\s+fish\b/gi )[-1];
print "Last fish is $color.\n";
Last fish is blue.

If you need to express this same notion of finding the last match in a single pattern without /g, you can do so with the negative lookahead assertion (?!THING). When you want the last match of arbitrary pattern A, you find A followed by any amount of not A through the end of the string. The general construct is A(?!.*A)*$, which can be broken up for legibility:

m{
    A               # find some pattern A
    (?!             # mustn't be able to find
        .*          # something
        A           # and A
    )
    $               # through the end of the string
}x

That leaves us with this approach for selecting the last fish:

$pond = 'One fish two fish red fish blue fish swim here.';
if ($pond =~ m{
                    \b  (  \w+) \s+ fish \b
                (?! .* \b fish \b )
            }six )
{
    print "Last fish is $1.\n";
} else {
    print "Failed!\n";
}
Last fish is blue.

This approach has the advantage that it can fit in just one pattern, which makes it suitable for similar situations as shown in Recipe 6.17. It has its disadvantages, though. It's obviously much harder to read and understand, although once you learn the formula, it's not too bad. But it also runs more slowly though  - around twice as slowly on the data set tested above.

See Also

The behavior of m//g in scalar context is given in the "Regexp Quote-like Operators" section of perlop (1), and in the "Pattern Matching Operators" section of Chapter 2 of Programming Perl; zero-width positive lookahead assertions are shown in the "Regular Expressions" section of perlre (1), and in the "rules of regular expression matching" section of Chapter 2 of Programming Perl


Previous: 6.4.  Commenting Regular ExpressionsPerl CookbookNext: 6.6. Matching Multiple Lines
6.4. Commenting Regular ExpressionsBook Index6.6. Matching Multiple Lines