This chapter, more than any other in this book, is about Laziness, Impatience, and Hubris - because this chapter is about good software design.
We've all fallen into the trap of using cut-and-paste when we should have chosen to define a higher-level abstraction, if only just a loop or subroutine.[1] To be sure, some folks have gone to the opposite extreme of defining ever-growing mounds of higher-level abstractions when they should have used cut-and-paste.[2] Generally, though, most of us need to think about using more abstraction rather than less.
[1] This is a form of False Laziness.
[2] This is a form of False Hubris.
(Caught somewhere in the middle are the people who have a balanced view of how much abstraction is good, but who jump the gun on writing their own abstractions when they should be reusing existing code.)[3]
[3] You guessed it, this is False Impatience. But if you're determined to reinvent the wheel, at least try to invent a better one.
Whenever you're tempted to do any of these things, you need to sit back and think about what will do the most good for you and your neighbor over the long haul. If you're going to pour your creative energies into a lump of code, why not make the world a better place while you're at it? (Even if you're only aiming for the program to succeed, you need to make sure it fits its ecological niche.)
The first step toward ecologically sustainable programming is simply: don't litter in the park. When you write a chunk of code, think about giving the code its own namespace, so that your variables and functions don't clobber anyone else's, or vice versa. A namespace is a bit like your home, where you're allowed to be as messy as you like, as long as you keep your external interface to other citizens moderately civil. In Perl, a namespace is called a package. Packages provide the fundamental building block upon which the higher-level concepts of modules and classes are constructed.
Like the notion of "home", the notion of "package" is a bit nebulous. Packages are independent of files. You can have many packages in a single file, or a single package that spans several files, just as your home could be one part of a larger building, if you live in an apartment, or could comprise several buildings, if your name happens to be Queen Elizabeth. But the usual size of a home is one building, and the usual size of a package is one file. Perl has some special help for people who want to put one package in one file, as long as you're willing to name the file with the same name as the package and give your file an extension of ".pm", which is short for "perl module". The module is the unit of reusability in Perl. Indeed, the way you use a module is with the use command, which is a compiler directive that controls the importation of functions and variables from a module. Every example of use you've seen until now has been an example of module reuse.
Object classes are another concept built on the package concept. The concept of classes therefore cuts across the concepts of files and modules. But the typical class is nevertheless implemented with a module. (If you're starting to get the feeling that much of Perl culture is governed by mere convention, then you're starting to get the right feeling, civilly speaking. The trend over the last 20 years or so has been to design computer languages that enforce a state of paranoia. You're expected to program every module as if it were in a state of siege. Certainly there are some feudal cultures where this is appropriate, but not all cultures are like this. In Perl culture, by contrast, you're expected to stay out of someone's home because you weren't invited in, not because there are bars[4] on the windows.)
[4] But Perl provides some bars if you want them, too. See the Safe module in Chapter 7, The Standard Perl Library, for instance.
Anyway, back to classes. When you use a module that implements a class, you're benefiting from the direct reuse of the software that implements that module. But with object classes you can get the additional benefits of indirect software reuse when the class you're using turns around and reuses other classes that it gets some characteristics from. But this is not primarily a book about object-oriented methodology, and we're not here to convert you into a raving object-oriented zealot, even if you want to be converted. There are already plenty of books out there for that. Perl's philosophy of object-oriented design fits right in with Perl's philosophy of everything else: use object-oriented design where it makes sense, and avoid it where it doesn't. Your call.
As we mentioned in the previous chapter, object-oriented programming in Perl is accomplished through use of references that happen to refer to thingies that know which class they're associated with. In fact, now that you know about references, you know almost everything hard about objects. The rest of it just "lays under the fingers", as a violinist would say. You will need to practice a little, though.
In this chapter we will discuss creation and use of packages, modules, and classes. Then we will review some of the essentials of object-oriented programming, explain how references become objects, and illustrate how these objects are manipulated as members of one or more classes. We'll also tell you how to tie ordinary variables into object classes to turn them into magical variables.
Perl provides a mechanism to protect different sections of code from inadvertently tampering with each other's variables. In fact, apart from certain magical variables, there's really no such thing as a global variable in Perl. Code is always compiled in the current package. The initial current package is package main, but at any time you can switch the current package to another one using the package declaration. The current package determines which symbol table is used for name lookups (for names that aren't otherwise package-qualified). The notion of "current package" is both a compile-time and run-time concept. Most name lookups happen at compile-time, but run-time lookups happen when symbolic references are dereferenced, and also when new bits of code are parsed under eval. In particular, eval operations know which package they were invoked in, and propagate that package inward as the current package of the evaluated code. (You can always switch to a different package within the eval string, of course, since an eval string counts as a block, as does a file loaded in with do, require, or use.)
The scope of a package declaration is from the declaration itself through the end of the innermost enclosing block (or until another package declaration at the same level, which hides the earlier one). All subsequent identifiers (except those declared with my, or those qualified with a different package name) will be placed in the symbol table belonging to the package. Typically, you would put a package declaration as the first declaration in a file to be included by require or use. But again, that's by convention. You can put a package declaration anywhere you can put a statement. You could even put it at the end of a block, in which case it would have no effect whatsoever. You can switch into a package in more than one place; it merely influences which symbol table is used by the compiler for the rest of that block. (This is how a given package can span more than one file.)
You can refer to identifiers[5]
in other packages by prefixing ("qualifying") the identifier with the
package name and a double colon: $Package::Variable
. If the
package name is null, the main package is assumed. That is,
$::sail
is equivalent to $main::sail
.[6]
(The old package delimiter was a single quote, which produced things like
$main'sail
and $'sail
. But a double colon is now the
preferred delimiter, in part because it's more readable to humans, and
in part because it's more readable to emacs macros. It also gives
C++ programmers a warm feeling.)
[5] By identifiers, we mean the names used as symbol table keys to access scalar variables, array variables, hash variables, functions, file or directory handles, and formats. Syntactically speaking, labels are also identifiers, but they aren't put into a particular symbol table; rather, they are attached directly to the statements in your program. Labels may not be package qualified.
[6] To clear up another bit of potential confusion, in a variable name like
$main::sail
, we use the term "identifier" to talk aboutmain
andsail
, but notmain::sail
. We call that a variable name instead, because an identifier may not contain a colon. The definition of an identifier is lexical, in that an identifier is a token that matches the pattern/^[A-Za-z_][A-Za-z_0-9]*$/
.
Packages may be nested inside other packages:
$OUTER::INNER::var
. This implies nothing about the order of
name lookups, however. There are no fallback symbol tables. All
undeclared symbols are either local to the current package, or must be
fully qualified from the outer package name down. For instance, there
is nowhere within package OUTER
that $INNER::var
refers
to $OUTER::INNER::var
. It would treat package INNER
as
a totally separate global package. Similarly, every package declaration
must declare a complete package name. No package name ever assumes any
kind of implied "prefix", even if (seemingly) declared within the scope
of some other package declaration.
Only identifiers (names starting with letters or underscore) are stored
in the current package's symbol table. All other symbols are kept in
package main, including all the magical punctuation-only variables
like $! and $_. In addition, the identifiers STDIN
,
STDOUT
, STDERR
, ARGV
, ARGVOUT
, ENV
,
INC
, and SIG
are forced to be in package main even when
used for purposes other than their built-in ones. Furthermore, if you
have a package called m
, s
, y
, or tr
,
then you can't use the qualified form of an identifier as a filehandle
because it will be interpreted instead as a pattern match, a
substitution, or a translation. Using uppercase package names avoids
this problem.
Assignment of a string to %SIG assumes the signal handler specified is in the main package, if the name assigned is unqualified. Qualify the signal handler name if you want to have a signal handler in a package, or don't use a string at all: assign a typeglob or a function reference instead:
$SIG{QUIT} = "quit_catcher"; # implies "main::quit_catcher" $SIG{QUIT} = *quit_catcher; # forces current package's sub $SIG{QUIT} = \&quit_catcher; # forces current package's sub $SIG{QUIT} = sub { print "Caught SIGQUIT\n" }; # anonymous sub
See my and local in Chapter 3, Functions, for other scoping issues. See the "Signals" section in Chapter 6, Social Engineering, for more on signal handlers.
The symbol table for a package happens to be stored in a hash whose name
is the same as the package name with two colons appended. The
main symbol table's name is thus %main::
, or %::
for short, since package main is the default. Likewise, the symbol table
for the nested
package we mentioned earlier is named %OUTER::INNER::
. As it
happens, the main symbol table contains all other top-level symbol
tables, including itself, so %OUTER::INNER::
is also
%main::OUTER::INNER::
.
When we say that a symbol table "contains" another symbol table, we mean that it contains a reference to the other symbol table. Since
package main is a top-level package, it contains a reference to itself,
with the result that %main::
is the same as
%main::main::
, and %main::main::main::
, and so on, ad
infinitum. It's important to check for this special case if you write
code to traverse all symbol tables.
The keys in a symbol table hash are the identifiers of the symbols in
the symbol table. The values in a symbol table hash are the
corresponding typeglob values. So when you use the *name
typeglob
notation, you're really just accessing a value in the hash that holds
the current package's symbol table. In fact, the following have the
same effect, although the first is potentially more efficient because it does the
symbol table lookup at compile time:
local *somesym = *main::variable; local *somesym = $main::{"variable"};
Since a package is a hash, you can look up the keys of the package, and hence all the variables of the package. Try this:
foreach $symname (sort keys %main::) { local *sym = $main::{$symname}; print "\$$symname is defined\n" if defined $sym; print "\@$symname is defined\n" if defined @sym; print "\%$symname is defined\n" if defined %sym; }
Since all packages are accessible (directly or indirectly) through package main, you can visit every package variable in the program, using code written in Perl. The Perl debugger does precisely that when you ask it to dump all your variables.
Assignment to a typeglob performs an aliasing operation; that is,
*dick = *richard;
causes everything accessible via the identifier richard
to also be
accessible via the symbol dick
. If you only want to alias a
particular variable or subroutine, assign a reference instead:
*dick = \$richard;
This makes $richard
and $dick
the same variable, but leaves
@richard
and @dick
as separate arrays. Tricky, eh?
This mechanism may be used to pass and return cheap references into or from subroutines if you don't want to copy the whole thing:
%some_hash = (); *some_hash = fn( \%another_hash ); sub fn { local *hashsym = shift; # now use %hashsym normally, and you # will affect the caller's %another_hash my %nhash = (); # populate this hash at will return \%nhash; }
On return, the reference will overwrite the hash slot in the
symbol table specified by the *some_hash
typeglob. This
is a somewhat sneaky way of passing around references cheaply
when you don't want to have to remember to dereference variables
explicitly. It only works on package variables though, which is why
we had to use local there instead of my.
Another use of symbol tables is for making "constant" scalars:
*PI = \3.14159265358979;
Now you cannot alter $PI
, which is probably a good thing, all in all.
When you do that assignment, you're just replacing one reference within the typeglob. If you think about it sideways, the typeglob itself can be viewed as a kind of hash, with entries for the different variable types in it. In this case, the keys are fixed, since a typeglob can contain exactly one scalar, one array, one hash, and so on. But you can pull out the individual references, like this:
*pkg::sym{SCALAR} # same as \$pkg::sym *pkg::sym{ARRAY} # same as \@pkg::sym *pkg::sym{HASH} # same as \%pkg::sym *pkg::sym{CODE} # same as \&pkg::sym *pkg::sym{GLOB} # same as \*pkg::sym *pkg::sym{FILEHANDLE} # internal filehandle, no direct equivalent *pkg::sym{NAME} # "sym" (not a reference) *pkg::sym{PACKAGE} # "pkg" (not a reference)
This is primarily used to get at the internal filehandle reference, since the other internal references are already accessible in other ways. But we thought we'd generalize it because it looks kind of pretty. Sort of. You probably don't need to remember all this unless you're planning to write a Perl debugger. So let's get back to the topic of writing good software.
Two special subroutine definitions that function as package
constructors and destructors[7]
are the BEGIN
and END
routines. The sub is optional
for these routines.
[7] Strictly speaking, these aren't constructors and destructors, but initializers and finalizers. And strictly speaking, packages aren't objects. But strictly speaking, we don't speak strictly around here too often.
A BEGIN
subroutine is executed as soon as possible, that is, the
moment it is completely defined, even before the rest of the containing
file is parsed. You may have multiple BEGIN
blocks within a
file - they will execute in order of definition. Because a BEGIN
block executes immediately, it can pull in definitions of subroutines
and such from other files in time to be visible during compilation of the
rest of the file.
This is important because subroutine declarations change how the rest
of the file will be parsed. At the very least, declaring a subroutine
allows it to be used as a list operator, without parentheses. And if
the subroutine is declared with a prototype, then calls to that
subroutine may be parsed like any of several built-in functions
(depending on which prototype is used).
An END
subroutine, by contrast, is executed as late as
possible, that is, when the
interpreter is being exited, even if it is exiting as a result of a
die function, or from an internally generated exception such as you'd
get when you try to call an undefined function. (But not if it's is
being blown out of the water by a signal - you have to trap that
yourself (if you can).)[8]
You may have multiple END
blocks within a file - they will execute
in reverse order of definition; that is: last in, first out (LIFO).
That is so that related BEGIN
s and END
s will nest the way you'd
expect, if you pair them up.
When you use the -n and -p switches to Perl, BEGIN
and END
work just as they do in awk(1), as a degenerate case.
For example, the output order of colors if you run the following
program is red, green, and blue:
die "green\n"; END { print "blue\n" } BEGIN { print "red\n" }
Just as eval provides a way to get compilation behavior during run-time,
so too BEGIN
provides a way to get run-time behavior during compilation.
But note that the compiler must execute BEGIN
blocks even if you're
just checking syntax with the -c switch. By symmetry, END
blocks
are also executed when syntax checking. Your END
blocks should not
assume that any or all of your main code ran. (They shouldn't do this
in any
event, since the interpreter might exit early from an exception.) This
is not a bad problem in general. At worst, it means you should test the
"definedness" of a variable before doing anything rash with it. In
particular, before saying something like:
system "rm -rf '$dir'"
you should always check that $dir
contains something meaningful, whether
or not you're doing it in an END
block. Caveat destructor.
Normally you can't call a subroutine that isn't defined. However, if
there is a subroutine named AUTOLOAD
in the undefined subroutine's
package (or in the case of an object method, in the package of any of
the object's base classes), then the AUTOLOAD
subroutine is called
with the same arguments as would have been passed to the original
subroutine. The fully qualified name of the original subroutine
magically appears in the package-global $AUTOLOAD
variable, in the
same package as the AUTOLOAD
routine.
Most AUTOLOAD
routines will load a definition for the undefined
subroutine in question using eval or require, then execute that
subroutine using a special form of goto that erases the stack frame
of the AUTOLOAD
routine without a trace.
The standard AutoSplit module is a tool used by module writers to
help split their modules into separate files (with filenames ending
in .al), each holding one routine. The files are placed in
the auto/ directory of the Perl library. These files can then be loaded
on demand by the standard AutoLoader module. A similar approach is
taken by the SelfLoader module, except that it autoloads functions from
the file's own DATA
area (which is less efficient in some ways and
more efficient in others). Autoloading of Perl functions is analogous
to dynamic loading of compiled C functions, except that autoloading (as
practiced by AutoLoader and SelfLoader) is done at the granularity of
the function call, whereas dynamic loading (as practiced by the
DynaLoader module) is done at the granularity of the complete module,
and will usually link in many C or C++ functions all at once. (See also
the AutoLoader, SelfLoader, and DynaLoader modules in Chapter 7.)
But an AUTOLOAD
routine can also just emulate the routine and never
define it. For example, let's pretend that any function that isn't defined
should just call system with its arguments. All you'd do is this:
sub AUTOLOAD { my $program = $AUTOLOAD; $program =~ s/.*:://; # trim package name system($program, @_); } date(); who('am', 'i'); ls('-l');
In fact, if you predeclare the functions you want to call that way, you don't even need the parentheses:
use subs qw(date who ls); date; who "am", "i"; ls "-l";
A more complete example of this is the standard Shell module described in Chapter 7, which can treat undefined subroutine calls as calls to programs.