Advanced Perl Programming

Advanced Perl ProgrammingSearch this book
Previous: 1.1 Referring to Existing VariablesChapter 1
Data References and Anonymous Storage
Next: 1.3 Nested Data Structures
 

1.2 Using References

References are absolutely essential for creating complex data structures. Since the next chapter is devoted solely to this topic, we will not say more here. This section lists the other advantages of Perl's support for indirection and memory management.

1.2.1 Passing Arrays and Hashes to Subroutines

When you pass more than one array or hash to a subroutine, Perl merges all of them into the @_ array available within the subroutine. The only way to avoid this merger is to pass references to the input arrays or hashes. Here's an example that adds elements of one array to the corresponding elements of the other:

@array1 = (1, 2, 3); @array2 = (4, 5, 6, 7);
AddArrays (\@array1, \@array2); # Passing the arrays by reference.
print "@array1 \n";
    sub AddArrays 
{
        my ($rarray1, $rarray2) = @_;
        $len2 = @$rarray2;  # Length of array2
        for ($i = 0 ; $i  < $len2 ;  $i++) {
            $rarray1->[$i] += $rarray2->[$i];   
        }
}

In this example, two array references are passed to AddArrays which then dereferences the two references, determines the lengths of the arrays, and adds up the individual array elements.

1.2.2 Performance Efficiency

Using references, you can efficiently pass large amounts of data to and from a subroutine.

However, passing references to scalars typically turns out not to be an optimization at all. I have often seen code like this, in which the programmer has intended to minimize copying while reading lines from a file:

while ($ref_line = GetNextLine()) {
        .....
        .....
}
    sub GetNextLine () {
        my $line = <F> ;
        exit(0) unless defined($line);
        .....
        return \$line;    # Return by reference, to avoid copying
}

GetNextLine returns the line by reference to avoid copying.

You might be surprised how little an effect this strategy has on the overall performance, because most of the time is taken by reading the file and subsequently working on $line. Meanwhile, the user of GetNextLine is forced to deal with indirections ($$ref_line) instead of the more straightforward buffer $line.[4]

[4] The operative word here is "typically." Most applications deal with lines 60-70 bytes long.

Incidentally, you can use the standard library module called Benchmark to time and compare different code implementations, like this:

use Benchmark;
timethis (100, "GetNextLine()"); # Call ProcessFile 100 times, and 
                                 # time it

The module defines a subroutine called timethis that takes a piece of code, runs it as many times as you tell it to, and prints out the elapsed time. We'll cover the use statement in Chapter 6, Modules.

1.2.3 References to Anonymous Storage

So far, we have created references to previously existing variables. Now we will learn to create references to "anonymous" data structures - that is, values that are not associated with a variable.

To create an anonymous array, use square brackets instead of parentheses:

$ra = [ ];         # Creates an empty, anonymous array
                   # and returns a reference to it
$ra = [1,"hello"]; # Creates an initialized anonymous array 
                   # and returns a reference to it

This notation not only allocates anonymous storage, it also returns a reference to it, much as malloc(3) returns a pointer in C.

What happens if you use parentheses instead of square brackets? Recall again that Perl evaluates the right side as a comma-separated expression and returns the value of the last element; $ra contains the value "hello", which is likely not what you are looking for.

To create an anonymous hash, use braces instead of square brackets:

$rh = { };                       # Creates an empty hash and returns a
                                 # reference to it
$rh = {"k1", "v1", "k2", "v2"};  # A populated anonymous hash

Both these notations are easy to remember since they represent the bracketing characters used by the two datatypes - brackets for arrays and braces for hashes. Contrast this to the way you'd normally create a named hash:

# An ordinary hash uses the prefix and is initialized with a list
# within parentheses
%hash = ("flock" => "birds", "pride" => "lions");

# An anonymous hash is a list contained within curly braces. 
# The result of the expression is a scalar reference to that hash.
$rhash = {"flock" => "birds", "pride" => "lions"};

What about dynamically allocated scalars ? It turns out that Perl doesn't have any notation for doing something like this, presumably because you almost never need it. If you really do, you can use the following trick: Create a reference to an existing variable, and then let the variable pass out of scope.

{
    my $a = "hello world";  # 1
    $ra = \$a;              # 2 
}
print "$$ra \n";            # 3

The my operator tags a variable as private (or localizes it, in Perl-speak). You can use the local operator instead, but there is a subtle yet very important difference between the two that we will clarify in Chapter 3. For this example, both work equally well.

Now, $ra is a global variable that refers to the local variable $a (not the keyword local). Normally, $a would be deleted at the end of the block, but since $ra continues to refer to it, the memory allocated for $a is not thrown away. Of course, if you reassign $ra to some other value, this space is deallocated before $ra is prepared to accept the new value.

You can create references to constant scalars like this:

$r = \10;  $rs = \"hello";

Constants are statically allocated and anonymous.

A reference variable does not care to know or remember whether it points to an anonymous value or to an existing variable's value. This is identical to the way pointers behave in C.

1.2.4 Dereferencing Multiple Levels of Indirection

We have seen how a reference refers to some other entity, including other references (which are just ordinary scalars). This means that we can have multiple levels of references, like this:

$a    = 10;
$ra   = \$a;     # reference to $a's value.
$rra  = \$ra;    # reference to a reference to $ra's value
$rrra = \$rra;   # reference to a reference to a reference ...

Now we'll dereference these. The following statements all yield the same value (that of $a):

print $a;        # prints 10. The following statements print the same.
print $$ra;      # $a seen from one level of indirection. 
print $$$rra;    # replace ra with {$rra} : still referring
                 # to $a's value
print $$$$rrra;  # ... and so on.

Incidentally, this example illustrates a convention known to Microsoft Windows programmers as "Hungarian notation."[5] Each variable name is prefixed by its type ("r" for reference, "rh" for reference to a hash, "i" for integer, "d" for double, and so on). Something like the following would immediately trigger some suspicion:

$$rh_collections[0] = 10;     # RED FLAG : 'rh' being used as an array?
You have a variable called $rh_collections, which is presumably a reference to a hash because of its naming convention (the prefix rh), but you are using it instead as a reference to an array. Sure, Perl will alert you to this by raising a run-time exception ("Not an ARRAY reference at - line 2."). But it is easier to check the code while you are writing it than to painstakingly exercise all the code paths during the testing phase to rule out the possibility of run-time errors.

[5] After Charles Simonyi who started this convention at Microsoft. This convention is a topic of raging debates on the Internet; people either love it or hate it. Apparently, even at Microsoft, the systems folks use it, while the application folks don't. In a language without enforced type checking such as Perl, I recommend using it where convenient.

1.2.5 A More General Rule

Earlier, while discussing precedence, we showed that $$rarray[1] is actually the same as ${$rarray}[1]. It wasn't entirely by accident that we chose braces to denote the grouping. It so happens that there is a more general rule.

The braces signify a block of code, and Perl doesn't care what you put in there as long as it yields a reference of the required type. Something like {$rarray} is a straightforward expression that yields a reference readily. By contrast, the following example calls a subroutine within the block, which in turn returns a reference:

sub test {
    return \$a;      # returns a reference to a scalar variable
}
$a = 10;
$b = ${test()};      # Calls a subroutine test within the block, which 
                     # yields a reference to $a
                     # This reference is dereferenced
print $b;            # prints "10"

To summarize, a block that yields a reference can occur wherever the name of a variable can occur. Instead of $a, you can have ${$ra} or ${$array[1]} (assuming $array[1] has a reference to $a), for example.

Recall that a block can have any number of statements inside it, and the last expression evaluated inside that block represents its result value. Unless you want to be a serious contender for the Obfuscated Perl contest, avoid using blocks containing more than two expressions while using the general dereferencing rule stated above.

1.2.5.1 Trojan horses

While we are talking about obfuscation, it is worth talking about a very insidious way of including executable code within strings. Normally, when Perl sees a string such as "$a", it does variable interpolation. But you now know that "a" can be replaced by a block as long as it returns a reference to a scalar, so something like this is completely acceptable, even within a string:

print "${foo()}"; 

Replace foo() by system ('/bin/rm *') and you have an unpleasant Trojan horse:

print "${system('/bin/rm *')}" 

Perl treats it like any other function and trusts system to return a reference to a scalar. The parameters given to system do their damage before Perl has a chance to figure out that system doesn't return a scalar reference.

Moral of the story: Be very careful of strings that you get from untrusted sources. Use the taint-mode option (invoke Perl as perl -T) or the Safe module that comes with the Perl distribution. Please see the Perl documentation for taint checking, and see the index for some pointers to the Safe module.


Previous: 1.1 Referring to Existing VariablesAdvanced Perl ProgrammingNext: 1.3 Nested Data Structures
1.1 Referring to Existing VariablesBook Index1.3 Nested Data Structures