Chapter 10
Information retrieval and descriptive metadata

Information discovery

A core service provided by digital libraries is to helping user find information. This chapter is the first of two on this subject. It begins with a discussion of catalogs, indexes, and other summary information used to describe objects in a digital library; the general name for this topic is descriptive metadata. This is followed by a section on the methods used to search bodies of text for specific information, the subject known as information retrieval. Chapter 11 extends the concepts to distributed searching, how to discover information that is spread across separate and different collections, or scattered over many computer systems.

These two chapters concentrate on methods used to search for specific information, but direct searching is just one of the strategies that people use to discover information. Browsing is the general term for the unstructured exploration of a body of information; it is a popular and effective method for discovering the unexpected. Most traditional libraries arrange their collections by subject classification to help browsing. Classification schemes, such as the Dewey Decimal Classification or the Library of Congress classification, provide both subject information and a hierarchical structure that can be used to organize collections. The most widely used web information service, Yahoo, is fundamentally a classification of web resources, augmented by searching. Digital libraries with their hyperlinks lend themselves to strategies that combine searching and browsing.

Here are some of the reasons why users seek for information in digital libraries. The range of these needs illustrates why information discovery is such a complex topic, and why no one approach satisfies all users or fits all materials.

Descriptive metadata

Many methods of information discovery do not search the actual objects in the collections, but work from descriptive metadata about the objects. The metadata typically consists of a catalog or indexing record, or an abstract, one record for each object. Usually it is stored separately from the objects that it describes, but sometimes it is embedded in the objects.

Descriptive metadata is usually expressed as text, but can be used to describe information that is in formats other than text, such as images, sound recording, maps, computer programs, and other non-text materials, as well as for textual documents. A single catalog can combine records for every variety of genre, media, and format. This enables users of digital libraries to discover materials in all media by searching textual records about the materials.

Descriptive metadata is usually created by professionals. Library catalogs and scientific indexes represent huge investments by skilled people, sustained over decades or even centuries. This economic fact is crucial to understanding current trends. On one hand, it is vital to build on the investments and the expertise behind them. On the other, there is great incentive to find cheaper and faster ways to create metadata, either by automatic indexing or with computer tools that enhance human expertise.

Catalogs

Catalog records are short records that provide summary information about a library object. The word catalog is applied to records that have a consistent structure, organized according to systematic rules. An abstract is a free text record that summarizes a longer document. Other types of indexing records are less formal than a catalog record, but have more structure than a simple abstract.

Library catalogs serve many functions, not only information retrieval. Some catalogs provide comprehensive bibliographic information that can not be derived directly from the objects. This includes information about authors or the provenance of museum artifacts. For managing collections, catalogs contain administrative information, such as where items are stored, either online or on library shelves. Catalogs are usually much smaller than the collections that they represent; in conventional libraries, materials that are stored on miles of shelving are described by records that can be contained in a group of card drawers at one location or an online database. Indexes to digital libraries can be mirrored for performance and reliability .

Information in catalog records is divided into fields and sub-fields with tags that identify them. Thus, there might be a field for an author, with a sub-field for a surname. Chapter 3 introduced cataloguing using the Anglo American Cataloguing Rules and the MARC format. MARC cataloguing is used for many types of material including monographs, serials, and archives. Because of the labor required to create a detailed catalog record, materials are catalogued once, often by a national library such as the Library of Congress, and the records distributed to other libraries through utilities such as OCLC. In digital libraries the role of MARC and the related cataloguing rules is a source of debate. How far can traditional methods of cataloguing migrate to support new formats, media types, and methods of publishing? Currently, MARC cataloguing retains its importance for conventional materials; librarians have extended it to some of the newer types of object found in digital libraries, but MARC has not been adopted by organizations other than traditional libraries.

Abstracting and indexing services.

The sciences and other technical fields rely on abstracting and indexing services more than catalogs. Each scientific discipline has a service to help users find information in journal articles. The services include Medline for medicine and biology, Chemical Abstracts for chemistry, and Inspec for physics, computing, and related fields. Each service indexes the articles from a large set of journals. The record for an article includes basic bibliographic information (authors, title, date, etc.), supplemented by subject information, organized for information retrieval. The details differ, but the services have many similarities. Since abstracting and indexing services emerged at a time when computers were slower and more expensive than today, the information is structured to support simple textual searches, but the records have proved to be useful in more flexible systems.

Scientific users frequently want information on a specific subject. Because of the subtleties of language, subject searching is unreliable unless there is indexing information that describes the subject of each object. The subject information can be an abstract, keywords, subject terms, or other information. Some services ask authors to provide keywords or an abstract, but this leads to gross inconsistencies. More effective methods have a professional indexer assign subject information to each item.

An effective but expensive approach is to use a controlled vocabulary. Where several terms could be used to describe a concept, one is used exclusively. Thus the indexer has a list of approved subject terms and rules for applying them. No other terms are permitted. This is the approach used in the Library of Congress subject headings and the National Library of Medicine's MeSH headings (see Panel 10.1).

Panel 10.1
MeSH - medical subject headings

The National Library of Medicine has provided information retrieval services for medicine and related fields since the 1960s. Medicine is a huge field with complex terminology. Since the same concept may be described by scientific terms or by a variety of terms in common use, the library has developed a controlled vocabulary thesaurus, known as MeSH. The library provides MeSH subject headings for each of the 400,000 articles that it indexes every year and every book acquired by the library. These subject terms can then be used for information retrieval.

MeSH is a set of subject terms, with about 18,000 primary headings. In addition, there is a thesaurus of about 80,000 chemical terms. The terms are organized in a hierarchy. At the top are general terms, such as anatomy, organisms, and diseases. Going down the hierarchy, anatomy, for example, is divided into sixteen topics, beginning with body regions and musculoskeletal system; body regions is further divided into sections, such as abdomen, axilla, back; some of these are sub-divided until the bottom of the hierarchy is reached. To help the user, MeSH provides thousands of cross-references.

The success of MeSH depends upon the professional staff who maintain the thesaurus and the indexers who assign subject terms to documents. It also requires users or reference librarians who understand the field and are able to formulate queries using MeSH terms and the MeSH structures.

Controlled vocabulary requires trained indexers. It also requires skilled users, with tools to assist the users, because the terms used in a search query must be consistent with the terms assigned by the indexer. Medicine in the United States is especially fortunate in having a cadre of reference librarians who can support the users. In digital libraries, the trend is to provide users with tools that permit them to find information directly without the help of a reference librarian. A thesaurus, such as MeSH or the Art and Architecture Thesaurus (Panel 10.2), can be used to relate the terminology that a user provides to the controlled terms that have been used for indexing.

Panel 10.2
The Art and Architecture Thesaurus

The Art and Architecture Thesaurus was developed by the J. Paul Getty Trust as a controlled vocabulary for describing and retrieving information on fine art, architecture, decorative art, and material culture. It has almost 120,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures, with an emphasis on Western civilization. The thesaurus can be used by archives, museums, and libraries to describe items in their collections. It also be used to search for materials.

Serious work on the thesaurus began in the early 1980s, when the Internet was still an embryo, but the data was created in a flexible format which has allowed production of many versions, including an open-access version on the web, a printed book, and various computer formats. The Getty Trust has explicitly organized the thesaurus so that it can be used by computer programs, for information retrieval, and natural language processing.

The Art and Architecture Thesaurus is arranged into seven categories, each containing a hierarchies of terms. The categories are associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects. A single concept is represented by a cluster of terms, one of which is established as the preferred term, or descriptor. The thesaurus provides not only the terminology for objects, but the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism.

The costs of developing and maintaining a large, specialized thesaurus are huge. Even in a mature field such as art and architecture terminology is changing continually, and the technical staff has to support new technology. The Getty Trust is extremely rich but, even so, developing the thesaurus was a major project spread over many years.

The Dublin Core

Since 1995, an international group, led by Stuart Weibel of OCLC, has been working to devise a set of simple metadata elements that can be applied to a wide variety of digital library materials. This is known as the Dublin Core. The name comes from Dublin, Ohio, the home of OCLC, where the first meeting was held. Several hundred people have participated in the Dublin Core workshops and discussed the design by electronic mail. Their spirit of cooperation is a model of how people with diverse interests can work together. They have selected fifteen elements, which are summarized in Panel 10.3.

Panel 10.3
Dublin Core elements

The following fifteen elements form the Dublin Core metadata set. All elements are optional and all can be repeated. The descriptions given below are condensed from the official Dublin Core definitions, with permission from the design team.

  1. Title. The name given to the resource by the creator or publisher.
  2. Creator. The person or organization primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
  3. Subject. The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemes is encouraged.
  4. Description. A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.
  5. Publisher. The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.
  6. Contributor. A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element (for example, editor, transcriber, and illustrator).
  7. Date. A date associated with the creation or availability of the resource.
  8. Type. The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary.
  9. Format. The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource.
  10. Identifier. A string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs.
  11. Source. Information about a second resource from which the present resource is derived.
  12. Language. The language of the intellectual content of the resource.
  13. Relation. An identifier of a second resource and its relationship to the present resource. This element permits links between related resources and resource descriptions to be indicated. Examples include an edition of a work (IsVersionOf), or a chapter of a book (IsPartOf).
  14. Coverage. The spatial locations and temporal durations characteristic of the resource.
  15. Rights. A rights management statement, an identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource.

Simplicity is both the strength and the weakness of the Dublin Core. Whereas traditional cataloguing rules are long and complicated, requiring professional training to apply effectively, the Dublin Core can be described simply, but simplicity conflicts with precision. The team has struggled with this tension. Initially the aim was to create a single set of metadata elements, suitable for untrained people who publish electronic materials to describe their work. Some people continue to hold this minimalist view. They would like to see a simple set of rules that anybody can apply.

Other people prefer the benefits that come from more tightly controlled cataloguing rules and would accept the additional labor and cost. They point out that extra structure in the elements results in extra precision in the metadata records. For example, if entries in a subject field are drawn from the Dewey Decimal Classification, it is helpful to record that fact in the metadata. To further enhance the effectiveness of the metadata for information retrieval, several of the elements will have recommended lists of values. Thus, there might be a specified set of types and indexers would be recommended to select from the list.

The current strategy is to have two options, "minimalist" and "structuralist". The minimalist will meet the original criterion of being usable by people who have no formal training. The structured option will be more complex, requiring fuller guidelines and trained staff to apply them.

Automatic indexing

Cataloguing and indexing are expensive when carried out by skilled professionals. A rule of thumb is that each record costs about fifty dollars to create and distribute. In certain fields, such as medicine and chemistry, the demand for information is great enough to justify the expense of comprehensive indexing, but these disciplines are the exceptions. Even monograph cataloguing is usually restricted to an overall record of the monograph rather than detailed cataloguing of individual topics within a book. Most items in museums, archives, and library special collections are not catalogued or indexed individually.

In digital libraries, many items are worth collecting but the costs of cataloguing them individually can not be justified. The numbers of items in the collections can be very large, and the manner in which digital library objects change continually inhibits long-term investments in catalogs. Each item may go through several versions in quick succession. A single object may be composed of many other objects, each changing independently. New categories of object are being continually devised, while others are discarded. Frequently, the user's perception of an object is the result of executing a computer program and is different with each interaction. These factors increase the complexity and cost of cataloguing digital library materials.

For all these reasons, professional cataloguing and indexing is likely to be less central to digital libraries than it is in traditional libraries. The alternative is to use computer programs to create index records automatically. Records created by automatic indexing are normally of poor quality, but they are inexpensive. A powerful search system will go a long way towards compensating for the low quality of individual records. The web search programs prove this point. They build their indexes automatically. The records are not very good, but the success of the search services shows that the indexes are useful. At least, they are better than the alternative, which is to have nothing. Panel 10.4 gives two examples of records that were created by automatic indexing.

Panel 10.4
Examples of automatic indexing

The two following records are typical of the indexing records that are created automatically by web search programs. They are lightly edited versions of records that were created by the Altavista system in 1997.

Digital library concepts. Key Concepts in the Architecture of the Digital Library. William Y. Arms Corporation for National Research Initiatives Reston, Virginia... http://www.dlib.org/dlib/July95/07arms.html - size 16K - 7-Oct-96 - English

Repository References. Notice: HyperNews at union.ncsa.uiuc.edu will be moving to a new machine and domain very soon. Expect interruptions. Repository References. This is a page. http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.html - size 5K - 12-May-95 - English

The first of these example shows automatic indexing at its best. It includes the author, title, date, and location of an article in an electronic journal. For many purposes, it is an adequate substitute for a record created by a professional indexer.

The second example shows some of the problems with automatic indexing. Nobody who understood the content would bother to index this web page. The information about location and date are probably all right, but the title is strange and the body of the record is simply the first few words of a the page.

Much of the development that led to automatic indexing came out of research in text skimming. A typical problem in this field is how to organize electronic mail. A user has a large volume of electronic mail messages and wants to file them by subject. A computer program is expected to read through them and assign them to subject areas. This is a difficult problem for people to carry out consistently and is a very difficult problem for a computer program, but steady progress has been made. The programs look for clues within the document. These clues may be structural elements, such as the subject field of an electronic mail message, they may be linguistic clues, or the program may simply recognize key words.

Automatic indexing also depends upon clues to be found in a document. The first of the examples in Panel 10.4 is a success, because the underlying web document provides useful clues. The Altavista indexing program was able to identify the title and author. For example, the page includes the tagged element:

     <title>Digital library concepts</title>

The author inserted these tags to guide web browsers in displaying the article. They are equally useful in providing guidance to automatic indexing programs.

One of the potential uses of mark-up languages, such as SGML or XML, is that the structural tags can be used by automatic indexing programs to build records for information retrieval. Within the text of a document, the string, "Marie Celeste" might be the name of a person, a book, a song, a ship, a publisher, a play, or might not even be a name. With structural mark-up, the string can be identified and labeled for what it is. Thus, information provided by the mark-up can be used to distinguish specific categories of information, such as author, title, or date.

Automatic indexing is fast and cheap. The exact costs are commercial secrets, but they are a tiny fraction of one cent per record. For the cost of a single record created by a professional cataloguer or indexer, computer programs can generate a hundred thousand or more records. It is economically feasible to index huge numbers of items on the Internet and even to index them again at frequent intervals.

Creators of catalogs and indexes can balance costs against perceived benefits. The most expensive forms of descriptive metadata are the traditional methods used for library catalogs, and by indexing and abstracting services; structuralist Dublin Core will be moderately expensive, keeping most of the benefits while saving some costs; minimalist Dublin Core will be cheaper, but not free; automatic indexing has the poorest quality at a tiny cost.

Attaching metadata to content

Descriptive metadata needs to be associated with the material that it describes. In the past, descriptive metadata has usually been stored separately, as an external catalog or index. This has many advantages, but requires links between the metadata and the object it references. Some digital libraries are moving in the other direction, storing the metadata and the data together, either by embedding the metadata in the object itself or by having two tightly linked objects. This approach is convenient in distributed systems and for long-term archiving, since it guarantees that computer programs have access to both the data and the metadata at the same time.

Mechanisms for associating metadata with web pages have been a subject of considerable debate. For an HTML page, a simple approach is to embed the metadata in the page, using the special HTML tag , as in Table 10.1. These are the meta tags from an HTML description of the Dublin Core Element Set. Note that the choice of tags is a system design decision. The Dublin Core itself does not specify how the metadata is associated with the material.

Table 10.1
Metadata represented with HTML <meta> tags

<meta name="DC.subject"
content="dublin core metadata element set">

<meta name="DC.subject"
content="networked object description">

<meta name="DC.publisher"
content="OCLC Online Computer Library Center, Inc.">

<meta name="DC.creator"
content="Weibel, Stuart L., weibel@oclc.org.">

<meta name="DC.creator"
content="Miller, Eric J., emiller@oclc.org.">

<meta name="DC.title"
content="Dublin Core Element Set Reference Page">

<meta name="DC.date"
content="1996-05-28">

<meta name="DC.form" scheme="IMT"
content="text/html">

<meta name="DC.language" scheme="ISO639"
content="en">

<meta name="DC.identifier" scheme="URL"
content="http://purl.oclc.org/metadata/dublin_core">

Since meta tags can not be used with file types other than HTML and rapidly become cumbersome, a number of organizations working through the World Wide Web Consortium have developed a more general structure known as the Resource Description Framework (RDF). RDF is described in Panel 10.5.

Panel 10.5
The Resource Description Framework (RDF)

The Resource Description Framework (RDF) is a method that has been developed for the exchange of metadata. It has been developed by the World Wide Web Consortium, drawing concepts together from several other efforts, including the PICS format, which was developed to provide rating labels, to identify violence, pornography, and similar characteristics of web pages. The Dublin Core team is working closely with the RDF designers.

A metadata scheme, such as Dublin Core, can be considered as having three aspects: semantics, syntax, and structure. The semantics describes how to interpret concepts such as date or creator. The syntax specifies how the metadata is expressed. The structure defines the relationships between the metadata elements, such as the concepts of day, month and year as components of a date. RDF provides a simple but general structural model to express the syntax. It does not stipulate the semantics used by a metadata scheme. XML is used to describe a metadata scheme and for exchange of information between computer systems and among schemes.

The structural model consists of resources, property-types, and values. Consider the simple statement that Shakespeare is the author of the play Hamlet. In the Dublin Core metadata scheme, this can be represented as:

Resource   Property-type   Value
Hamlet -----------> creator -------------> Shakespeare
  -----------> type -------------> play

A different metadata scheme, might use the term author in place of creator, and might use the term type with a completely different meaning. Therefore, the RDF mark-up would make explicit that this metadata is expressed in the Dublin core scheme:

     <DC:creator> Shakespeare</DC:creator>
     <DC:type> play</DC:type>

To complete this example, Hamlet needs to be identified more precisely. Suppose that it referenced by the (imaginary) URL, "http://hamlet.org/". Then the full RDF record, with XML mark-up, is:

     <RDF:RDF>
        <RDF:description RDF:about = "http://hamlet.org/">
           <DC:creator> Shakespeare</DC:creator>
           <DC:type> play</DC:type>
        </RDF:description>
     </RDF:RDF>

The mark-up in this record makes explicit that the terms description and about are defined in the RDF scheme, while creator and type are terms defined in the Dublin Core (DC). One more step is needed to complete this record: the schemes RDF and DC must be defined as XML namespaces.

The RDF structural model permits resources to have property-types that refer to other resources. For example, a database might include a record about Shakespeare with metadata about him, such as when and where he lived, and the various ways that he spelled his name. The DC:Creator property-type could reference this record as follows:

   <DC:creator RDF:about = "http://people.net/WS/">

In this manner, arbitrarily complex metadata descriptions can be built up from simple components. By using the RDF framework for the syntax and structure, combined with XML representation, computer systems can associate metadata with digital objects and exchange metadata from different schemes.

Techniques of information retrieval

The rest of this chapter is a brief introduction to information retrieval. Information retrieval is a field in which computer scientists and information professionals have worked together for many years. It remains an active area of research and is one of the few areas of digital libraries to have a systematic methodology for measuring the performance of various methods.

Basic concepts and terminology

The various methods of information retrieval build on some simple concepts to search large bodies of information. A query is a string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols.

Some methods of information retrieval compare the query with every word in the entire text, without distinguishing the function of the various words. This is called full text searching. Other methods identify bibliographic or structural fields, such as author or heading, and allow searching on specified field, such as "author = Gibbon". This is called fielded searching. Full text and fielded searching are both powerful tools, and modern methods of information retrieval often use the techniques in combination. Fielded searching requires some method of identifying the fields. Full text searching does not require such support. By taking advantage of the power of modern computers, full text searching can be effective even on unprocessed text, but heterogeneous texts of varying length, style, and content are difficult to search effectively and the results can be inconsistent. The legal information systems, Westlaw and Lexis, are based on full text searching; they are the exceptions. When descriptive metadata is available, most services prefer either fielded searching or free text searching of abstracts or other metadata.

Some words occur so frequently that they are of little value for retrieval. Examples include common pronouns, conjunctions, and auxiliary verbs, such as "he", "and", "be", and so on. Most systems have a list of common words which are ignored both in building inverted files and in queries. This is called a stop list. The selection of stop words is difficult. The choice clearly depends upon the language of the text and may also be related to the subject matter. For this reason, instead of have a predetermined stop list, some systems use statistical methods to identify the most commonly used words and reject them. Even then, no system is perfect. There is always the danger that some perfectly sensible queries might be rejected because every word is in the stop list, as with the quotation, "To be or not to be?"

Panel 10.6
Inverted files

An inverted file is a list of the words in a set of documents and their locations within those documents. Here is a small part of an inverted file.

Word Document Location
abacus 3 94
  19 7
  19 212
actor 2 66
  19 200
  29 45
aspen 5 43
atoll 11 3
  34 40

This inverted file shows that the word "abacus" is word 94 in document 3, and words 7 and 212 in document 19; the word "actor" is word 66 in document 2, word 200 in document 19, and word 45 in document 29; and so on. The list of locations for a given word is called an inverted list.

An inverted file can be used to search a set of documents to find every occurrence of a single search term. In the example above, a search for the word "actor" would look in the inverted file and find that the word appears in documents 2, 19, and 29. A simple reference to an inverted file is typically a fast operation for a computer.

Most inverted lists contain the location of the word within the document. This is important for displaying the result of searches, particularly with long documents. The section of the document can be displayed prominently with the search terms highlighted.

Since inverted files contain every word in a set of documents, except stop words, they are large. For typical digital library materials, the inverted file may approach half the total size of all the documents, even after compression. Thus, at the cost of storage space, an inverted file provides a fast way to find every occurrence of a single word in a collection of documents. Most methods of information retrieval use inverted files.

Boolean searching

Panel 10.6 describes inverted files, the basic computational method that is used to compare the search terms against a collection of textual documents. Boolean queries consist of two or more search terms, related by a logical operators, such as and, or, or not. Consider the query "abacus and actor" applied to the inverted file in Panel 10.6. The query includes two search terms separated by a Boolean operator. The first stage in carrying out this query is to read the inverted lists for "abacus" (documents 3 and 19) and for "actor" (documents 2, 19, and 29). The next stage is to compare the two lists for documents that are in both lists. Both words are in document 19, which is the only document that satisfies the query. When the inverted lists are short, Boolean searches with a few search terms are almost as fast as simple queries, but the computational requirements increase dramatically with large collections of information and complex queries.

Inverted files can be used to extend the basic concepts of Boolean searching. Since the location of words within documents are recorded in the inverted lists, they can be used for searches that specify the relative position of two words, such as a search for the word "West" followed by "Virginia". They can also be used for truncation, to search for words that begin with certain letters. In many search systems, a search for "comp?" will search for all words that begin the four letters "comp". This will find the related words "compute", "computing", "computer", "computers", and "computation". Unfortunately, this approach will not distinguish unrelated words that begin with the same letters, such as "company".

Ranking closeness of match

Boolean searching is a powerful tool, but it finds exact matches only. A search for "library" will miss "libraries"; "John Smith" and "J. Smith" are not treated as the same name. Yet everybody recognizes that these are similar. A range of techniques address such difficulties.

The modern approach is not to attempt to match documents exactly against a query but to define some measure of similarity between a query and each document. Suppose that the total number of different words in a set of documents is n. A given document can be represented by a vector in n-dimensional space. If the document contains a given word, the vector has value 1 in the corresponding dimension, otherwise 0. A query can also be represented by a vector in the same space. The closeness with which a document matches a query is measured by how close these two vectors are to each other. This might be measured by the angle between these two vectors in n-dimensional space. Once these measures have been calculated for every document, the results can be ranked from the best match to the least good. Several ranking technique are variants of the same general concepts. A variety of probabilistic methods make use of the statistical distribution of the words in the collection. These methods derive from the observation that the exact words chosen by an author to describe a topic or by a user to express a query were chosen from a set of possibilities, but that other words might be equally appropriate.

Natural language processing and computational linguistics

The words in a document are not simply random strings of characters. They are words in a language, such as English, arranged into phrases, sentences, and paragraphs. Natural language processing is the branch of computer science that uses computers to interpret and manipulate words as part of a language. The spelling checkers that are used with word processors are a well-known application. They use methods of natural language processing to suggest alternative spellings for words that they do not recognize.

Computational linguistics deals with grammar and linguistics. One of the achievements of computational linguistics has been to develop computer programs that can parse almost any sentence with good accuracy. A parser analyzes the structure of sentences. It categorizes words by part of speech (verb, noun, adjective, etc.), groups them into phrases and clauses, and identifies the structural elements (subject, verb, object, etc.). For this purpose, linguists have been required to refine their understanding of grammar, recognizing far more subtleties than were contained in traditional grammars. Considerable research in information retrieval has been carried out using noun phrases. In many contexts, the content of a sentence can be found by extracting the nouns and noun phrases and searching on them. This work has not been restricted to English, but has been carried out for many languages.

Parsing requires an understanding of the morphology of words, that is variants derived from the same stem, such as plurals (library, libraries), and verb forms (look, looks, looked). For information retrieval, it is often effective to reduce morphological variants to a common stem and to use the stem as a search term. This is called stemming. Stemming is more effective than truncation since it separates words with totally different meanings, such as "computer" from "company", while recognizing that "computer" and "computing" are morphological variants from the same stem. In English, where the stem is almost always at the beginning of the word, stemming can be carried out by truncating words and perhaps making adjustments to the final few letters. In other language, such as German, it is also necessary to trim at the beginning of words.

Computational linguists have developed a range of dictionaries and other tools, such as lexicons and thesauruses, that are designed for natural language processing. A lexicon contains information about words, their morphological variants, and their grammatical usage. A thesaurus relates words by meaning. Some of these tools are general purpose; others are tied to specific disciplines. Two were described earlier in this chapter; the Art and Architecture Thesaurus and the MeSH headings for medicine. Linguistic can greatly augment information retrieval. By recognizing words as more than random strings of characters, they can recognize synonyms ("car" and "automobile"), relate a general term and a particular instance ("science" and "chemistry"), or a technical term and the vernacular ("cranium" and "brain"). The creation of a lexicon or thesaurus is a major undertaking and is never complete. Languages change continually, especially the terminology of fields in which there is active research.

User interfaces and information retrieval systems

Information retrieval systems depend for their effectiveness on the user making best use of the tools provided. When the user is a trained medical librarian or a lawyer whose legal education included training in search systems, these objectives are usually met. Untrained users typically do much less well at formulating queries and understanding the results.

A feature of the vector-space and probabilistic methods of information retrieval is that they are most effective with long queries. An interesting experiment is to use a very long query, perhaps an abstract from a document. Using this as a query is equivalent to asking the system to find documents that match the abstract. Many modern search systems are remarkably effective when given such an opportunity, but methods that are based on vector space or linguistic techniques require a worthwhile query to display their full power.

Statistics of the queries that are used in practice show that most queries consist of a single word, which is a disappointment to the developers of powerful retrieval systems. One reasons for these short queries is that many users made their first searches on Boolean systems, where the only results found are exact matches, so that a long query usually finds no matches. These early systems had another characteristic that encouraged short queries. When faced with a long or complex query their performance deteriorated terribly. Users learned to keep their queries short. Habits that were developed with these systems have been retained even with more modern systems.

However, the tendency of users to supply short queries is more entrenched than can be explained by these historical factors, or by the tiny input boxes sometimes provided. The pattern is repeated almost everywhere. People appear to be inhibited from using long queries. Another unfortunate characteristic of users, which is widely observed, is that few people read even the most simple instructions. Digital libraries are failing to train their users in effective searching and users do not take advantage of the potential of the systems that they use.

Evaluation

Information retrieval has a long tradition of performance evaluation. Two long-standing criteria are precision and recall. Each refers to the results from carrying out a single search on a given body of information. The result of such a search is a set of hits. Ideally every hit would be relevant to the original query, and every relevant item in the body of information would be found. In practice, it usually happens that some of the hits are irrelevant and that some relevant items are missed by the search.

Suppose that, in a collection of 10,000 documents, 50 are on a specific topic. An ideal search would find these 50 documents and reject all others. An actual search identifies 25 documents of which 20 prove to be relevant but 5 were on other topics. In this instance, the precision is 20 out of 25, or 0.8. The recall is 20 out of 50, or 0.4.

Precision is much easier to measure than recall. To calculate precision, a knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure, since there is no way to know all the items in a collection that satisfy a specific query other than to go systematically through the entire collection, looking at every object to decide whether it fits the criteria. In this example, all 10,000 documents must be examined. With large numbers of documents, this is an imposing task.

Tipster and TREC

Tipster

Tipster was a long-running project sponsored by DARPA to improve the quality of text processing methods. The focus was on several problems, all of which are important in digital libraries. Research on document detection combines both information retrieval on stored documents and identifying relevant documents from a stream of new text. Information extraction is the capability to locate specified information within a text, and summarization the capability to condense the size of a document or collection.

Over the years, Tipster has moved its emphasis from standard information retrieval to the development of components that can tackle specific tasks and architectures for sharing these components. The architecture is an ambitious effort, based on concepts from object-oriented programming, to define standard classes for the basic components of textual materials, notably documents, collections, attributes, and annotations.

The TREC conferences

TREC is the acronym for the annual series of Text Retrieval Conferences, where researchers can demonstrate the effectiveness of their methods on standard bodies of text. The TREC conferences are an outstanding example of quantitative research in digital libraries. They are the creation of the National Institute of Standards and Technology, with the help of many other organizations. The organizers of the conferences have created a corpus of several million textual documents, a total of more than five gigabytes of data. Researchers evaluate their work by attempting a standard set of tasks. One task is to search the corpus for topics provided by a group of twenty surrogate users. Another task evaluates systems that match a stream of incoming documents against standard queries. Participants at TREC conferences include large commercial companies, small information retrieval vendors, and university research groups. By evaluating their different methods on the same large collection they are able to gauge their strengths and enhance their systems.

The TREC conferences provides an opportunity to compare the performance of different techniques, including methods using automatic thesauri, sophisticated term weighting, natural language techniques, relevance feedback, and advanced machine learning. In later years, TREC has introduced a number of smaller tracks to evaluate other aspects of information retrieval, including Chinese texts, spoken documents, and cross-language retrieval. There is also a track that is experimenting with methods for evaluating interactive retrieval.

The TREC conferences have had enormous impact on research in information retrieval. This is an impressive program and a model for all areas of digital libraries research.

The criteria of precision and recall have been of fundamental importance in the development of information retrieval since they permit comparison of different methods, but they were devised in days when computers were slower and more expensive than today. Information retrieval then consisted of a single search of a large set of data. Success or failure was a one-time event. Today, searching is usually interactive. A user will formulate a query, carry out an initial search, examine the results, and repeat the process with a modified query. The effective precision and recall should be judged by the overall result of a session, not of a single search. In the jargon of the field, this is called searching "with a human in the loop".

Performance criteria, such as precision and recall, measure technical properties of aspects of computer systems. They do not measure how a user interacts with a system, or what constitutes an adequate result of a search. Many newer search programs have a strategy of ranking all possible hits. This creates a high level of recall at the expense of many irrelevant hits. Hopefully, the higher ranking hits will have high precision, with a long tail of spurious hits. Criteria are needed that measure the effectiveness of the ranking in giving high ranks to the most relevant items. This chapter began by recognizing that users look for information for many different reasons and use many strategies to seek for information. Sometimes they are looking for specific facts; sometimes they are exploring a topic; only rarely are they faced with the standard information retrieval problem, which is to find every item relevant to a well-defined topic, with the minimal number of extraneous items. With interactive computing, users do not carry out a single search. They iterate through a series of steps, combining searching, browsing, interpretation, and filtering of results. The effectiveness of information discovery depends upon the users' objectives and how well the digital library meets them.



Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000