Digital Libraries: Chapter 12 (1999)

Chapter 12
Object models, identifiers, and structural metadata

Materials in digital library collections

Information comes in many forms and formats, each of which must be captured, stored, retrieved, and manipulated. Much of the early development of digital libraries concentrated on material that has a direct analog to some physical format. These materials can usually be represented as simply structured computer files. Digital libraries can go far beyond such simple digital objects; they include anything that can be represented in a digital format. The digital medium allows for new types of library objects such as software, simulations, animations, movies, slide shows, and sound tracks, with new ways to structure material. Computing has introduced its own types of object: spread sheets, databases, symbolic mathematics, hypertext, and more. Increasingly, computers and networks support continuous streams of digital information, notably speech, music, and video. Even the simplest digital objects may come in many versions and be replicated many times.

Methods for managing this complexity fall into several categories: identifiers for digital objects, data types which specify what the data represents, and structural metadata to represent the relationship between digital objects and their component parts. In combination, these techniques create an object model, a description of some category of information that enables computer systems to store and provide access to complex information. In looking at these topics, interoperability and long term persistence are constant themes. Information in today's digital libraries must be usable many years from now, using computer systems that have not yet been imagined.

Works, expressions, manifestations, and items

Users of a digital library usually want to refer to items at a higher level of abstraction than a file. Common English terms, such as "report", "computer program", or "musical work" often refer to many digital objects that can be grouped together. The individual objects may have different formats, minor differences of content, different usage restrictions, and so on, but usually the user considers them as equivalent. This requires a conceptual model that it able to describe content at various levels of abstraction.

Chapter 1 mentioned the importance of distinguishing between the underlying intellectual work and the individual items in a digital library, and the challenges of describing such differences in a manner that makes sense for all types of content. In a 1998 report, an IFLA study of the requirements for bibliographic records proposed the following four levels for describing content.

Work. A work is the underlying abstraction, such as The Iliad, Beethoven's Fifth Symphony, or the Unix operating system.
Expression. A work is realized through an expression. Thus, The Iliad was first expressed orally, then it was written down as a fixed sequence of words. A musical work can be expressed as a printed score or by any one of many performances. Computer software, such as Unix, has separate expressions as source code and machine code.
Manifestation. A expression is given form in one or more manifestations. The text of The Iliad has been manifest in numerous manuscripts and printed books. A musical performance can be distributed on CD, or broadcast on television. Software is manifest as files, which may be stored or transmitted in any digital medium.
Item. When many copies are made of a manifestation, each is a separate item, such as a specific copy of a book or computer file.

Clearly, there are many subtleties buried in these four levels. The distinctions between versions, editions, translations, and other variants are a matter for judgment, not for hard rules. Some works are hard to fit into the model, such as jazz music where each performance is a new creation, or paintings where the item is the work. Overall, however, the model holds great promise as a framework for describing this complicated subject.

Expressions

Multimedia

Chapters 9, 10, and 11 paid particular attention to works that are expressed as texts. Textual materials have special importance in digital libraries, but object models have to support all media and all types of object. Some non-textual objects are files of fixed information. They include the digital equivalents of familiar objects, such as maps, audio recordings and video, and other objects where the user is provided with a direct rendering of the stored form of a digital object. Even such apparently straightforward materials have many subtleties when the information is some medium other than text. This sections looks at three examples. The first is Panel 12.1which describes the Alexandria Digital Library of geospatial information.

Panel 12.1
Geospatial collections: the Alexandria library

The Alexandria Digital Library at the University of California, Santa Barbara is led by Terence Smith. It was one of the six projects funded by the Digital Libraries Initiative from 1994 to 1998. The collections in Alexandria cover any data that is referenced by a geographical footprint. This includes all classes of terrestrial maps, aerial and satellite photographs, astronomical maps, databases, and related textual information. The project combines a broad program of research with practical implementation at the university's map library.

These collections have proved to be a fertile area for digital libraries research. Geospatial information is of interest in many fields: cartography, urban planning, navigation, environmental studies, agriculture, astronomy, and more. The data comes from many sources: survey data, digital photographs, printed maps, and analog images. A digital library of these collections faces many of the same issues as a digital library of textual documents, but forces the researchers to examine every topic again to see which standard techniques can be used and which need to be modified.

Information retrieval and metadata

With geospatial data, information retrieval concentrates on coverage or scope, in contrast with other categories of material, where the focus is on subject matter or bibliographic information such as author or title. Coverage defines the geographical area covered, such as the city of Santa Barbara or the Pacific Ocean. Scope describes the varieties of information, such as topographical features, political boundaries, or population density. Alexandria provides several methods for capturing such information and using it for retrieval.

Coordinates of latitude and longitude provide basic metadata for maps and for geographical features. Systematic procedures have been developed for capturing such data from existing maps. A large gazetteer has been created which contains millions of place names from around the world; it forms a database and a set of procedures that translate between different representations of geospatial references, e.g., place names, geographic features, coordinates, postal codes, and census tracts. The Alexandria search engine is tailored to the peculiarities of searching for place names. Researchers are making steady progress at the difficult task of feature extraction, using automatic programs to identify objects in aerial photographs or printed maps, but this is a topic for long-term research.

Computer systems and user interfaces

Digitized maps and other geospatial information create large files of data. Alexandria has carried out research in applying methods of high-performance computing to digital libraries. Wavelets are a method that the library has exploited for both storage and in user interfaces. They provide a multi-level decomposition of an image, in which the first level is a small coarse image that can be used as a thumbnail. Extra levels provide greater detail at the expense of larger volumes of data.

Alexandria has developed several user interfaces, each building on earlier experience. Common themes have been the constraints of the small size of computer displays and the slow performance of the Internet in delivering large files to users. Good design helps to mitigate the first, though it is impossible to fit a large image and a comprehensive search interface onto a single screen. To enhance performance, every attempt is made never to transmit the same information twice. The user interfaces retain state throughout a session, so that user can leave the system and return to the same place without having to repeat any steps.

The next example of a format other than text looks at searching video collections. Panel 12.2 describes how the Informedia digital library has combined several methods of information retrieval to build indexes automatically and search video segments. Individually, each method is imprecise, but in combination results are achieved for indexing and retrieval that are substantially better than could be obtained from any single method. The team use the term "multi-modal" to describe this combination of methods.

Panel 12.2
Informedia: multi-modal information retrieval

Chapter 8 introduced the Informedia digital library of segments of digitized video and described some of the user interface concepts. Much of the research work of the project aims at indexing and searching video segments, with no assistance from human indexers or catalogers.

The multi-modal approach to information retrieval

The key word in Informedia is "multi-modal". Many of the techniques used, such as identifying changes of scene, use computer programs to analyze the video materials for clues. They analyze the video track, the sound track, the closed captioning if present, and any other information. Individually, the analysis of each mode gives imperfect information but combining the evidence from all can be surprisingly effective.

Informedia builds on a number of methods from artificial intelligence, such as speech recognition, natural language processing, and image recognition. Research in these fields has been carried out by separate research projects; Informedia brings them together to create something that is greater than the sum of its parts.

Adding material to the library

The first stage in adding new materials to the Informedia collection is to take the incoming video material and to divide it into segments by topics. The computer program uses a variety of techniques of image and sound processing to look for clues as to when one topic ends and another begins. For example, with materials from broadcast television, the gap intended for advertisements often coincides with a change of topic.

The next stage is to identify any text associated with the segment. This is obtained by speech recognition on the sound track, by identifying any captions within the video stream, and from closed captioning if present. Each of these inputs is prone to error. The next phase is to process the raw text with a variety of tools from natural language processing to create an approximate index record that is loaded into the search system.

Speech recognition

The methods of information discovery discussed in Chapters 10 and 11 can be applied to audio material, such as audio tapes and the sound track of video, if the spoken word can be converted to computer text. This conversion proves to be a tough computing problem, but steady progress has been made over the years, helped by ever increasing computer power.

Informedia has to tackle some of the hardest problems in speech recognition, including speaker independence, indistinct speech, noise, unlimited vocabulary, and accented speech. The computer program has to be independent of who is speaking. The speech on the sound track may be indistinct, perhaps because of background noise of music. It may contain any word in the English language including proper nouns, slang, and even foreign words. Under these circumstances, even a human listener misses some words. Informedia successfully recognizes about 50 to 80 percent of the words, depending on the characteristics of the specific video segment.

Searching the Informedia collection

To search the Informedia collection, a user provides a query either by typing or by speaking it aloud to be processed by the speech recognition system. Since there may be errors in the recognition of a spoken query and since the index is known to be built from inexact data, the information retrieval uses a ranking method that identifies the best apparent matches. The actual algorithm is based on the same research as the Lycos web search program and the index uses the same underlying retrieval system.

The final example in this section looks at the problems of delivering real-time information, such as sound recordings, to users. Digitized sound recordings are an example of continuous streams of data, requiring a special method of dissemination, so that the data is presented to the user at the appropriate pace. Sound recordings are on the boundary of what can reasonably be transmitted over the Internet as it exists today. User interfaces have a choice between real-time transmission, usually of indifferent quality, and batch delivery, requiring the user to wait for higher quality sound to be transmitted more slowly. Panel 12.3 describes RealAudio, one way to disseminate low-quality sound recordings within the constraints of today's Internet.

Panel 12.3
RealAudio

One hour of digitized sound of CD quality requires 635 megabytes of storage if uncompressed. This poses problems for digital libraries. The storage requirements for any substantial collection are huge and transmission needs high-speed networks. Uncompressed sound of this quality challenges even links that run at 155 megabits/second. Since most local area networks share Ethernets that run at less than a tenth of this speed and dial-up links are much slower, some form of compression is needed.

RealAudio is a method of compression and an associated protocol for transmitting digitized sound. In RealAudio format, one hour of sound requires about 5 megabytes of storage. Transmission uses a streaming protocol between the repository where the information is stored and a program running on the user's computer. When the user is ready, the repository transmits a steady sequence of sound packets. As they arrive at the user's computer, they are converted to sound and played by the computer. This is carried out at strict time intervals. There is no error checking. If a packet has not arrived when the time to play it is reached, it is ignored and the user hears a short gap in the sound.

This process seems crude, but, if the network connection is reasonably clear, the transmission of spoken sounds in RealAudio is quite acceptable when transmitted over dial-up lines at 28.8 thousand bits per second. An early experiment with RealAudio was to provide a collection of broadcast segments from the programs of National Public Radio.

The service uses completely standard web methods, except in two particulars, both of which are needed to transmit audio signals over the Internet in real time. The first is that the user's browser must accept a stream of audio data in RealAudio format. This requires adding a special player to the browser, which can be downloaded over the Internet. The second is that, to achieve a steady flow of data, the library sends data using the UDP protocol instead of TCP. Since some network security systems do not accept UDP data packets, RealAudio can not be delivered everywhere.

Dynamic and complex objects

Many of the digital objects that are now being considered for digital library collections can not be represented as static files of data.

Dynamic objects. Dynamic or active library objects include computer programs, Java applets, simulations, data from scientific sensors, or video games. With these types of object, what is presented to the user depends upon the execution of computer programs or other external activities, so that the user gets different results every time the object is accessed.
Complex objects. Library objects can be made up from many inter-related elements. These elements can have various relationships to each other. They can be complementary elements of content, such as the audio and picture channels of a video recording. They can be alternative manifestations, such as a high-resolution or low-resolution satellite image, or they can be surrogates, such as data and metadata. In practice these distinctions are often blurred. Is a thumbnail photograph an alternative manifestation, or is it metadata about a larger image?
Alternate disseminations. Digital objects may offer the user a choice of access methods. Thus a library object might provide the weather conditions at San Francisco Airport. When the user accesses this object, the information returned might be data, such as the temperature, precipitation, wind speed and direction, and humidity, or it might be a photograph to show cloud cover. Notice that this information might be read directly from sensors, when requested, or from tables that are updated at regular intervals.
Databases. A database comprises many alternative records, with different individuals selected each time the database is accessed. Some databases can be best thought of as complete digital library collections, with the individual records as digital objects within the collections. Other databases, such as directories, are library objects in their own right.

The methods for managing these more general objects are still subjects for debate. Whereas the web provides a unifying framework that most people use for static files, there is no widely accepted framework for general objects. Even the terminology is rife with dispute. A set of conventions that relate the intellectual view of library materials to the internal structure is sometimes called a "document model", but, since it applies to all aspects of digital libraries, "object model" seems a better term.

Identification

The first stage in building an object model is to have a method to identify the materials. The identifier is used to refer to objects in catalogs and citations, to store and access them, to provide access management, and to archive them for the long term. This sounds simple, but identifiers must meet requirements that overlap and frequently contradict each other. Few topics in digital libraries cause as much heated discussion as names and identifiers.

One controversy is whether semantic information should be embedded in a name. Some people advocate completely semantic names. An example is the Serial Item and Contribution Identifier standard (SICI). By a precisely defined set of rules, a SICI identifies either an issue of a serial or an article contained within a serial. It is possible to derive the SICI directly from a journal article or citation. This is a daunting objective and the SICI succeeds only because there is already a standard for identifying serial titles uniquely. The following is a typical SICI; it identifies a journal article published by John Wiley & Sons:

0002-8231(199601)47:1<23:TDOMII>2.0.TX;2-2

Fully semantic names, such as SICIs, are inevitably restricted to narrow classes of information; they tend to be long and ugly because of the complexity of the rules that are used to generate them. Because of the difficulties in creating semantic identifiers for more general classes of objects, compounded by arguments over trademarks and other names, some people advocate the opposite: random identifiers that contain no semantic information about who assigned the name and what it references. Random strings used as identifiers can be shorter, but without any embedded information they are hard for people to remember and may be difficult for computers to process.

In practice, many names are mnemonic; they contain information that makes them easy to remember. Consider the name "www.apple.com". At first glance this appears to be a semantic name, the web site of a commercial organization called Apple, but this is just an informed guess. The prefix "www" is conventionally used for web sites, but this is merely a convention. There are several commercial organizations called Apple and the name gives no hint whether this web site is managed by Apple Computer or some other company.

Another difficulty is to decide what a name refers to: work, expression, manifestation, or item. As an example, consider the International Standard Book Number (ISBN). This was developed by publishers and the book trade for their own use. Therefore ISBNs distinguish separate products that can be bought or sold; a hard back book will usually have a different ISBN from a paper back version, even if the contents are identical. Libraries, however, may find this distinction to be unhelpful. For bibliographic purposes, the natural distinction is between versions where the content differs, not the format. For managing a collection or in a rare book library, each individual copy is distinct and needs its own identifier. There is no universal approach to naming that satisfies every need.

Domain names and Uniform Resource Locators (URL)

The most widely used identifiers on the Internet are domain names and Uniform Resource Locators (URLs). They were introduced in Chapter 2. Panel 12.4 gives more information about domain names and how they are allocated.

Panel 12.4
Domain names

The basic purpose of domain names is to identify computers on the Internet by name, rather than all references being by IP address. An advantage of this approach is that, if a computer system is changed, the name need not change. Thus the domain name "library.dartmouth.edu" was assigned to a series of computers over the years, with different IP addresses, but the users were not aware of the changes.

Over the years, additional flexibility has been added to domain names. A domain name need no longer refer to a specific computer. Several domain names can refer to the same computer, or one domain name can refer to a service that is spread over a set of computers.

The allocation of domain names forms a hierarchy. At the top are the root domain names. One set of root names are based on types of organization, such as:

     .com   commercial
     .edu   educational
     .gov   government
     .net   network services
     .org   other organizations

There is a second series of root domains, based on geography. Typical examples are:

     ca   Canada
     .jp   Japan
     .nl   Netherlands

Organizations are assigned domain names under one of these root domains. Typical examples are:

     cmu.edu   Carnegie Mellon University
     elsevier.nl   Elsevier Science
     loc.gov   Library of Congress
     dlib.org   D-Lib Magazine

Historically, in the United States, domain names have been assigned on a first-come, first-served basis. There is a small annual fee. Minimal controls were placed on who could receive a domain name, and what the name might be. Thus anybody could register the name "pittsburgh.net", without any affiliation to the City of Pittsburgh. This lack of control has led to a proliferation of inappropriate names, resulting in trademark disputes and other arguments.

URLs extend the concept of domain names in several directions, but all are expansions of the basic concept of providing a name for a location on the Internet. Panel 12.5 describes some of the information that can be contained in a URL.

Panel 12.5
Information contained in a Uniform Resource Locator (URL)

The string of characters that comprises a URL is highly structured. A URL combines the specification of a protocol, a file name, and options that will be used to access the file. It can contain the following.

Protocols. The first part of a full URL is the name of a protocol or service ending with a colon. Typical examples are http:, mailto:, and ftp:.

Absolute and relative URLs. A URL can refer to a file by its domain name or its location relative to another file. If the protocol is followed by "//", the URL contains a full domain name, e.g.,

http://www.dlib.org/figure.jpg

Otherwise, the address is relative to the current directory. For example, within an HTML page, the anchor,

refers to a file "figure.jpg" in the same directory.

Files. A URL identifies a specific file on a specified computer system. Thus, in the URL,

http://www.dlib.org/contents.html

"www.dlib.org" is a domain name that identifies a computer on the Internet; "contents.html" is a file on that computer.

Port. A server on the Internet may provide several services running concurrently. The TCP protocol provides a "port" which identifies which service to use. The port is specified as a colon followed by a number at the end of the domain name. Thus, the URL,

http://www.dlib.org:80/index.html

references the port number 80. Port 80 is the default port for the HTTP protocol and therefore could be omitted from this particular URL.

Parameters. A variety of parameters can be appended to a URL, following either a "#" or "?" sign. These are passed to the server when the file is accessed.

URLs have proved extremely successful. They permit any number of versatile applications to be built on the Internet, but they pose a long-term problem for digital libraries. The problem is the need for persistence. Users of digital libraries wish to be able to access material consistently over long periods of time. URLs identify resources by a location derived from a domain name. If the domain name no longer exists, or if the resource moves to a different location, the URL is no longer valid.

A famous example of the problem comes from the early days of the web. At the beginning, the definitive documentation was maintained at CERN in Geneva. When the World Wide Web Consortium was established at M.I.T. in 1994, this documentation was transferred. Every hyperlink that pointed to CERN was made invalid. In such instances, the convention is to leave behind a web page stating that the site has moved, but forwarding addresses tend to disappear with time or become long chains. If a domain name is canceled, perhaps because a company goes out of business, all URLs based on that domain name are broken for ever. There are various forms of aliases that can be used with domain names and URLs to ameliorate this problem, but they are tricks not solutions.

Persistent Names and Uniform Resource Names (URN)

To address this problem, the digital library community and publishers of electronic materials have become interested in persistent names. These are sometimes called Uniform Resource Names (URNs). The idea is simple. Names should be globally unique and persist for all time. The objective is to have names that can last longer than any software system that exists today, longer even than the Internet itself.

A persistent name should be able to reference any Internet resource or set of resources. One application of URNs is to reference the current locations of copies of an object, defined by a list of URLs. Another application is to provide electronic mail addresses that do not need to be changed when a person changes jobs or moves to a different Internet service provider. Another possibility is to provide the public keys of named services. In each of these applications, the URN is linked to data, which needs to have an associated data type, so that computer systems can interpret and process the data automatically. Panel 12.6 describes the handle system, which is a system to create and manage persistent names, and its use by publishers for digital object identifiers (DOIs).

Panel 12.6
Handles and Digital Object Identifiers

Handles are a naming system developed at CNRI as part of a general framework proposed by Robert Kahn of CNRI and Robert Wilensky of the University of California, Berkeley. Although developed independently from the ideas of URNs, the two concepts are compatible and handles can be considered the first URN system to be used in digital libraries. The handle system has three parts:

A name scheme that allows independent authorities to create handle names with confidence that they are unique.
A distributed computer system that stores handles along with data that they reference, e.g., the locations where material is stored. A handle is resolved by sending it to the computer system and receiving back the stored data.
Administrative procedures to ensure high-quality information over long periods of time.

Syntax

Here are two typical handles:

hdl:cnri.dlib/magazine
hdl:loc.music/musdi.139

These strings have three parts. The first indicates that the string is of type hdl:. The next, "cnri.dlib" or "loc.music", is a naming authority. The naming authority is assigned hierarchically. The first part of the name, "cnri" or "loc", is assigned by the central authority. Subsequent naming authorities, such as "cnri.dlib" are assigned locally. The final part of the handle, following the "/" separator, is any string of characters that are unique to the naming authority.

Computer system

The handle system offers a central computer system, known as the global handle registry, or permits an organization to set up a local handle service on its own computers, to maintain handles and provide resolution services. The only requirements are that the top-level naming authorities must be assigned centrally and that all naming authorities must be registered in the central service. For performance and reliability, each of these services can be spread over several computers and the data can be automatically replicated. A variety of caching services are provided as are plug-ins for web browsers, so that they can resolve handles.

Digital Object Identifiers

In 1996, an initiative of the Association of American Publishers adopted the handle system to identify materials that are published electronically. These handles are called Digital Object Identifiers (DOI). This led to the creation of an international foundation which is developing DOIs further. Numeric naming authorities are assigned to publishers, such as "10.1006" which is assigned to Academic Press. The following is the DOI of a book published by Academic Press:

doi:10.1006/0121585328

The use of numbers for naming authorities reflects a wish to minimize the semantic information. Publishers frequently reorganize, merge, or transfer works to other publishers. Since the DOIs persist through such changes, they should not contain the name of the original publisher in a manner that might be confusing.

Computer systems for resolving names

Whatever system of identifiers is used, there must be a fast and efficient method for a computer on the Internet to discover what the name refers to. This is known as resolving the name. Resolving a domain name provides the IP address or addresses of the computer system with that name. Resolving a URN provides the data associated with it.

Since almost every computer on the Internet has the software needed to resolve domain names and to manipulate URLs, several groups have attempted to build systems for identifying materials in digital libraries that use these existing mechanisms. One approach is OCLC's PURL system. A PURL is a URL, such as:

http://purl.oclc.org/catalog/item1

In this identifier, "purl.oclc.org" is the domain name of a computer that is expected to be persistent. On this computer, the file "catalog/item1" holds a URL to the location where the item is currently stored. If the item is moved, this URL must be changed, but the PURL, which is the external name, is unaltered.

PURLs add an interesting twist to how names are managed. Other naming systems set out to have a single coordinated set of names for a large community, perhaps the entire world. This can be considered a top-down approach. PURLs are bottom-up. Since each PURL server is separate, there is no need to coordinate the allocation of names among them. Names can be repeated or used with different semantics, depending entirely on the decisions made locally. This contrasts with the Digital Object Identifiers, where the publishers are building a single set of names that are guaranteed to be unique.

Structural metadata and object models

Data types

Data types are structural metadata which is used to describe the different types of object in a digital library. The web object model consist of hyperlinked files of data, each with a data type which tells the user interface how to render the file for presentation to the user. The standard method of rendering is to copy the entire object and render it on the user's computer. Chapter 2 introduced the concept of data type and discussed the importance of MIME as a standard for defining the type of files that are exchanged by electronic mail or used in the web. As described in Panel 12.7, MIME is a brilliant example of a standard that is flexible enough to cover a wide range of applications, yet simple enough to be easily incorporated into computer systems.

Panel 12.7. MIME

The name MIME was originally an abbreviation for "Multipurpose Internet Mail Extensions". It was developed by Nathaniel Borenstein and Ned Freed explicitly for electronic mail, but the approach that they developed has proved to be useful in a wide range of Internet applications. In particular, it is one of the simple but flexible building blocks that led to the success of the web.

The full MIME specification is complicated by the need to fit with a wide variety of electronic mail systems, but for digital libraries the core is the concept that MIME calls "Content-Type". MIME describes a data type as three parts, as in the following example:

Content-Type: text/plain; charset = "US-ASCII"

The structure of the data type is a type ("text"), a subtype ("plain"), and one or more optional parameters. This example defines plain text using the ASCII character set. Here are some commonly used types:

text/plain
text/html

     image/gif
     image/jpeg
     image/tiff

audio/basic
audio/wav

video/mpeg
video/quickdraw

The application type provides a data type for information that is to be used with some application program. Here are the MIME types for files in PDF, Microsoft Word, and PowerPoint formats:

     application/pdf
     application/msword
     application/ppt

Notice that application/msword is not considered to be a text format, since a Microsoft Word file may contain information other than text and requires a specific computer program to interpret it.

These examples are official MIME types, approval by the formal process for registering MIME types. It is also possible to create unofficial types and subtypes with names beginning "x-", such as "audio/x-pn-realaudio".

Lessons from MIME

MIME's success is a lesson on how to turn a good concept into a widely adopted system. The specification goes to great lengths to be compatible with the systems that preceded it. Existing Internet mail systems needed no alterations to handle MIME messages. The processes for checking MIME versions, for registering new types and subtypes, and for changing the system were designed to fit naturally within standard Internet procedures. Most importantly, MIME does not attempt to solve all problems of data types and is not tightly integrated into any particular applications. It provides a flexible set of services that can be used in many different contexts. Thus a specification that was designed for electronic mail has had its greatest triumph in the web, an application that did not exist when MIME was first introduced.

Complex objects

Materials in digital libraries are frequently more complex than files that can be represented by simple MIME types. They may be made of several elements with different data types, such as images within a text, or separate audio and video tracks; they may be related to other materials by relationships such as part/whole, sequence, and so on. For example, a digitized text may consist of pages, chapters, front matter, an index, illustrations, and so on. An article in an online periodical might be stored on a computer system as several files containing text and images, with complex links among them. Because digital materials are easy to change, different versions are created continually.

A single item may be stored in several alternate digital formats. When existing material is converted to digital form, the same physical item may be converted several times. For example, a scanned photograph may have a high-resolution archival version, a medium quality version, and a thumbnail. Sometimes, these formats are exactly equivalent and it is possible to convert from one to the other (e.g., an uncompressed image and the same image stored with a lossless compression). At other times, the different formats contain different information (e.g., differing representations of a page of text in SGML and PostScript formats).

Structural types

To the user, an item appears as a single entity and the internal representation is unimportant. A bibliography or index will normally refer to it as a single object in a digital library. Yet, the internal representation as stored within the digital library may be complex. Structural metadata is used to represent the various components and the relationships among them. The choice of structural metadata for a specific category of material creates an object model.

Different categories of object need different object models: e.g., text with SGML mark-up, web objects, computer programs, or digitized sound recordings. Within each category, rules and conventions describe how to organize the information as sets of digital objects. For example, specific rules describe how to represent a digitized sound recording. For each category, the rules describe how to represent material in the library, how the components are grouped as a set of digital objects, the internal structure of each, the associated metadata, and the conventions for naming the digital objects. The categories are distinguished by a structural type.

Object models for computer programs have been a standard part of computing for many years. A large computer program consists of many files of programs and data, with complex structure and inter-relations. This relationship is described in a separate data structure that is used to compile and build the computer system. For a Unix program, this is called a "make file".

Structural types need to be distinguished from genres. For information retrieval, it is convenient to provide descriptive metadata that describes the genre. This is the category of material considered as an intellectual work. Thus, genres of popular music include jazz, blues, rap, rock, and so on. Genre is a natural and useful way to describe materials for searching and other bibliographic purposes, but another object model is required for managing distributed digital libraries. While feature films, documentaries, and training videos are clearly different genres, their digitized equivalents may be encoded in precisely the same manner and processed identically; they are the same structural type. Conversely, two texts might be the same genre - perhaps they are both exhibition catalogs - but, if one is represented with SGML mark-up, and the other in PDF format, then they have different structural types and object models.

Panel 12.8 describes an object model for a scanned image. This model was developed to represent digitized photographs, but the same structural type can be used for any bit-mapped image, including maps, posters, playbills, technical diagrams, or even baseball cards. They represent different content, but are stored and manipulated in a computer with the same structure. Current thinking suggests that even complicated digital library collections can be represented by a small number of structural types. Less than ten structural types have been suggested as adequate for all the categories of material being converted by the Library of Congress. They include: digitized image, a set of pages images, a set of page images with associated SGML text, digitized sound recording, and digitized video recording.

Panel 12.9
An object model for scanned images

In joint work with the Library of Congress, a group from CNRI developed object models for some types of material digitized by the National Digital Library Program. The first model was for digitized images, such as scanned photographs. When each image is converted, several digital versions are made. Typically, there is a high-resolution archival version, a medium resolution version for access via the web, and one or more thumbnails. Each of these versions has its own metadata and there is other metadata that is shared by all the versions.

The first decision made in developing the object model was to keep separate the bibliographic metadata that describes the content. This is stored in separate catalog records. Other metadata is stored with the scanned images and is part of the object model. This includes structural metadata, which relates the various versions, and administrative metadata, which is used for access management.

Since there are occasions on which the individual versions may be used separately, each version is stored as a separate digital object. Each of these digital objects is identified by a handle and has two elements: a digitized image and metadata. Another digital object, known as a meta-object, is used to bring the versions together. The main purpose of the meta-object is to provide a list of the versions. It also has a data type field that describes the function of each image, which can be for reference, access, or as a thumbnail. The meta-object has its own identifier, a handle, and contains metadata that is shared by all the versions of the image.

Object models for interoperability

Object models are evolving slowly. How the various parts should be grouped together can rarely be specified in a few dogmatic rules. The decision depends upon the context, the specific objects, their type of content and sometimes the actual content. After the introduction of printing, the structure of a book took decades to develop to the form that we know today, with its front matter, chapters, figures, and index. Not surprisingly, few conventions for object models and structural metadata have yet emerged in digital libraries. Structural metadata is still at the stage where every digital library is experimenting with its own specifications, making frequent modifications as experience suggests improvements.

This might appear to pose a problem for interoperability, since a client program will not know the structural metadata used to store a digital object in an independent repository. However, clients do not need to know the internal details of how repositories store objects; they need to know the functions that the repository can provide.

Consider the storage of printed materials that have been converted to digital formats. A single item in a digital library collection consists of a set of page images. If the material is also converted to SGML, there will be the SGML mark-up, DTD, style sheets, and related information. Within the repository, structural metadata defines the relationship between the components, but a client program does not need to know these details. Chapter 8 looked at a user interface program that manipulates a sequence of page images. The functions include display a specified page, or go to the page with a specific page number, and so forth. A user interface that is aware of the functions supported for this structural type of information is able to present the material to the user, without knowing how it is stored on the repository.

Disseminations

In a digital library, the stored form of information is rarely the same as the form that is delivered to the user. In the web model of access, information is copied from the server to the user's computer where it is rendered for use by the user. This rendering typically takes the form of converting the data in the file into a screen image, using suitable fonts and colors, and embedding that image in a display with windows, menus, and icons.

Getting stored information and presenting it to the user of a digital library can be much more complicated than the web model. At the very least, a general architecture must allow that the stored information will be processed by computer programs on the server before being sent to the client or on the client before being presented to the user. The data that is rendered need not have been stored explicitly on the server as a file or files. Many servers run a computer program on a collection of stored data, extract certain information and provide it to the user. This information may be fully formatted by the server, but is often transmitted as a file that has been formatted in HTML, or some other intermediate format that can be recognized by the user's computer.

Two important types of dissemination are direct interaction between the client and the stored digital objects, and continuous streams of data. When the user interacts with the information, access by a user is not a single discrete event. A well-known example is a video game, but the same situation applies with any interactive material, such as a simulation of some physical situation. Access to such information consists of a series of interactions, which are guided by the user's control and the structural metadata of the individual objects.

Disseminations controlled by the client

Frequently, a client program may have a choice of disseminations. The selection might be determined by the equipment that a user has; when connected over a low-speed network, users sometimes choose to turn off images and receive only the text of web pages. The selection might be determined by the software on the user's computer; if a text is available in both HTML and PostScript versions, a user whose computer does not have software to view PostScript will want the HTML dissemination. The selection may be determined on non-technical grounds, by the user's wishes or convenience; a user may prefer to see a shortened version of a video program, rather than the full length version.

One of the purposes of object models is to provide the user with a variety of dissemination options. The aim of several research projects is to provide methods so that the client computer can automatically discover the range of disseminations that are available and select the one most useful at any given time. Typically, one option will be a default dissemination, what the user receives unless a special request is made. Another option is a short summary. This is often a single line of text, perhaps an author and title, but it can be fancier, such as a thumbnail image or a short segment of video. Whatever method is chosen, the aim of the summary is to provide the user with a small amount of information that helps to identify and describe the object. This topic is definitely the subject of research and there are few good examples of such systems in practical use.

Last revision of content: January 1999
Formatted for the Web: December 2002
(c) Copyright The MIT Press 2000