Book HomePerl & XML

1.6. XML Gotchas

This section introduces topics we think you should keep in mind as you read the book. They are the source of many of the problems you'll encounter when working with XML.

Well-formedness

XML has built-in quality control. A document has to pass some minimal syntax rules in order to be blessed as well-formed XML. Most parsers fail to handle a document that breaks any of these rules, so you should make sure any data you input is of sufficient quality.

Character encodings

Now that we're in the 21st century, we have to pay attention to things like character encodings. Gone are the days when you could be content knowing only about ASCII, the little character set that could. Unicode is the new king, presiding over all major character sets of the world. XML prefers to work with Unicode, but there are many ways to represent it, including Perl's favorite Unicode encoding, UTF-8. You usually won't have to think about it, but you should still be aware of the potential.

Namespaces

Not everyone works with or even knows about namespaces. It's a feature in XML whose usefulness is not immediately obvious, yet it is creeping into our reality slowly but surely. These devices categorize markup and declare tags to be from different places. With them, you can mix and match document types, blurring the distinctions between them. Equations in HTML? Markup as data in XSLT? Yes, and namespaces are the reason. Older modules don't have special support for namespaces, but the newer generation will. Keep it in mind.

Declarations

Declarations aren't part of the document per se; they just define pieces of it. That makes them weird, and something you might not pay enough attention to. Remember that documents often use DTDs and have declarations for such things as entities and attributes. If you forget, you could end up breaking something.

Entities

Entities and entity references seem simple enough: they stand in for content that you'd rather not type in at that moment. Maybe the content is in another file, or maybe it contains characters that are difficult to type. The concept is simple, but the execution can be a royal pain. Sometimes you want to resolve references and sometimes you'd rather keep them there. Sometimes a parser wants to see the declarations; at other times it doesn't care. Entities can contain other entities to an arbitrary depth. They're tricky little beasties and we guarantee that if you don't give careful thought to how you're going to handle them, they will haunt you.

Whitespace

According to XML, anything that isn't a markup tag is significant character data. This fact can lead to some surprising results. For example, it isn't always clear what should happen with whitespace. By default, an XML processor will preserve all of it -- even the newlines you put after tags to make them more readable or the spaces you use to indent text. Some parsers will give you options to ignore space in certain circumstances, but there are no hard and fast rules.

In the end, Perl and XML are well suited for each other. There may be a few traps and pitfalls along the way, but with the generosity of various module developers, your path toward Perl/XML enlightenment should be well lit.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.