23.1 An Overview of XML Parsing
When
your application must parse XML documents, your first, fundamental
choice is what kind of parsing to use. You can use
event-driven parsing, where the parser reads the
document sequentially and calls back to your application each time it
parses a significant aspect of the document (such as an element). Or
you can use object-based parsing, where the
parser reads the whole document and builds in-memory data structures,
representing the document, that you can then navigate. SAX is the
main, normal way to perform event-driven parsing, and DOM is the
main, normal way to perform object-based parsing. In each case there
are alternatives, such as direct use of expat for event-driven
parsing and pyRXP for object-based parsing, but I do not cover these
alternatives in this book. Another interesting possibility is offered
by pulldom, which is covered later in this
chapter.
Event-driven parsing requires fewer resources, which makes it
particularly suitable when you need to parse very large documents.
However, event-driven parsing requires you to structure your
application accordingly, performing your processing (and typically
building auxiliary data structures) in your methods that are called
by the parser. Object-based parsing gives you more flexibility about
the ways in which you can structure your application. It may be more
suitable when you need to perform very complicated processing, as
long as you can afford the extra resources needed for object-based
parsing (typically, this means that you are not dealing with very
large documents). Object-based approaches also support programs that
need to modify or create XML documents, as covered later in this
chapter.
As a general guideline, when you are still undecided after studying
the various trade-offs, I suggest you try event-driven parsing when
you can see a reasonably direct way to perform your
program's tasks through this approach. Event-driven
parsing is more scalable; therefore, if your program can perform its
task via event-driven parsing, it will be applicable to larger
documents than it would be able to handle otherwise. If event-driven
parsing is too confining, try pulldom instead. I
suggest you consider (non-pull) DOM only when you
think DOM is the only way to perform your program's
tasks without excessive contortions. In that case DOM may be best, as
long as you can accept the resulting limitations, in terms of the
maximum size of documents that your program is able to support and
the costs in time and memory for processing.
|