Content Management

part of Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005

Parco dei Mostri (park of monsters), below the town of Bomarzo, Italy (1.5 hours north of Rome). This was the park of the 16th century Villa Orsini and is filled with grotesque sculptures.

There are two fundamental elements to content management: (1) storing stuff in a content repository, and (2) supporting the workflow of a group of people engaged in putting stuff into that repository. This chapter will treat the storage problem first and then the workflow support problem. We'll also look at version control for both content and software, at look and feel design for individual pages, and at navigation design and information architecture.

Part of the art of content management for an online learning community is reducing the number of types of content. For example, consider a community where the publisher says "I want articles [magnet content], comments from users on articles, news from the publisher, comments on news from users, questions from users, and answers to questions." A naive implementation from these specifications would result in the creation of six database tables: articles, comments_on_articles, news, comments_on_news, questions, answers. From the RDBMS's perspective, there is nothing overwhelming about six tables. But consider that every new table defined in the RDBMS implies roughly twenty Web scripts. Ten of these scripts will constitute a user experience: view a directory of content in Table A, view one category, view one item, view the newest items, grab a form to insert an item, confirm insertion, request an email alert of comments on an item. Ten of these scripts will constitute an administrator's experience: view a directory of content in Table A, view one category, view one item, view the newest items, approve an item, disapprove an item, delete an item, confirm deletion of an item, etc. It will be a bit tough to code these twenty scripts in a general fashion because the SQL statements will differ in at least the table names used.

Consider further that to offer a complete index of site content, you'll have to write a program that pulls text from at least six tables into a single index.

How different are these six kinds of content, really? We'll look at the tables that we need to define for storing articles, then proceed to the other types of content.

A Simple Data Model for Articles

Here's a very basic data model for storing articles:

create table articles (
	article_id		integer primary key,
	-- who contributed this and when
	creation_user		not null references users,
	creation_date		not null date,
	-- what language is this in?
	-- visit http://www.w3.org/International/O-charset-lang
	-- to see the allowable 2-character codes (en is English, ja is Japanese)
	language		char(2) references language_codes,
	-- could be text/html or text/plain or some sort of XML document
	mime_type		varchar(100) not null,
	-- will hold the title in most cases
	one_line_summary	varchar(200) not null,
	-- the entire article; 4 GB limit
	body			clob
);

Should all articles in the database be shown to all users? Perhaps it would be nice to have the ability to store an article and hold it for editorial examination:

create table articles (
	article_id		integer primary key,
	creation_user		not null references users,
	creation_date		not null date,
	language		char(2) references language_codes,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

Do you trust all the programmers in your organization to remember to include a where editorial_status = 'approved' clause in every script on the site? If not, perhaps it would be better to rename the table altogether and build a view for use by application programmers:

create table articles_raw (
	article_id		integer primary key,
	...
	editorial_status	varchar(30)
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

create view articles_approved 
as
select * 
from articles_raw
where editorial_status = 'approved';

If you change your mind about how to represent approval status, you won't need to update dozens of Web scripts; you need only change the definition of the articles_approved view. (See the views chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/views for more on this idea of using SQL views as a means of programming abstraction.)

Comments on Articles

Recall the six required elements of online community:

magnet content authored by experts
means of collaboration
powerful facilities for browsing and searching both magnet content and contributed content
means of delegation of moderation
means of identifying members who are imposing an undue burden on the community and ways of changing their behavior and/or excluding them from the community without them realizing it
means of software extension by community members themselves

A facility that lets a user post an alternative perspective to a published article is a means of collaboration that distinguishes a one-way publishing site from an online community. More interestingly, the facility lifts the Internet application out of the constraints of the literate culture within which Western culture has operated ever since Gutenberg (1452). A literate culture produces such works as the Michelin Green Guide to Italy: "Extending below the town is the park of the 16th-century Villa Orsini (Parco dei Mostri) which is a Mannerist creation with a series of fantastically shaped sculptures." Compare that description to these photos showing just a tiny portion of the Parco dei Mostri ("Park of Monsters"):

If a friend of yours came back from this place and showed these slides, you'd expect to hear something much richer and more interesting than the Michelin Guide's sentence. A literate culture operates with the implicit assumption that knowledge is closed, that Italian tourism can fit into a book. Perhaps the 350 pages of the Green Guide aren't enough, but some quantity of writers and pages would suffice to encapsulate everything worth knowing about Italy.

Comments are often the most interesting material on a site. Here's one from http://philip.greenspun.com/humor/bill-gates:

"I must say, that all of you who do not recognize the absolute genius of Bill Gates are stupid. You say that bill gates stole this operating system. Hmm.. i find this interesting. If he stole it from steve jobs, why hasn't Mr. Jobs relentlessly sued him and such. Because Mr. Jobs has no basis to support this. Macintosh operates NOTHING like Windows 3.1 or Win 95/NT/98. Now for the mac dissing. Mac's are good for 1 thing. Graphics. Thats all. Anything else a mac sucks at. You look in all the elementary schools of america.. You wont see a PC. Youll see a mac. Why? Because Mac's are only used by people with undeveloped brains."

-- Allen (chuggie@geocities.com), August 10, 1998

Oral cultures do not share this belief. Knowledge is open-ended. People may hold differing opinions without one person being wrong. There is not necessarily one truth; there may be many truths. Though he didn't grow up in an oral culture, Shakespeare knew this. Watch Troilus and Cressida and its five perspectives on the nature of a woman's love and try to figure out which perspective Shakespeare thinks is correct.

Feminists, chauvinists, warmongers, pacifists, Jew-haters, inclusivists, cautious people, heedless people, misers, doctors, medical malpractice lawyers, atheists, and the pious are all able to quote Shakespeare in support of their beliefs. That's because Shakespeare uses the multiple characters in each of his plays to show his culture's multiple truths.

In the 400 years since Shakespeare we've become much more literate. There is usually one dominant truth. Sometimes this is because we've truly figured something out. It is tough to argue that a physics textbook on Newtonian mechanics should be an open-ended discussion (though a user comment facility might still be very useful in providing clarifying explanations for confusing sections). Yet even in the natural sciences, one can find many examples in which the culture of literacy distorts discourse.

Academic journals of taxonomic botany reveal disagreement on whether Specimen 947 collected from a particular field in Montana is a member of species X or species Y. But the journals imply agreement on the taxonomy, i.e., on how to build a categorization tree for the various species. If you were to eavesdrop on a cocktail party in a university's department of botany, you'd discover that even this agreement is illusory. There is widespread disagreement on what constitutes the correct taxonomy. Hardly anyone believes that the taxonomy used in journals is correct, but botanists have to stick with it for publication because otherwise older journal articles would be rendered incomprehensible. Taxonomic botany based on an oral culture or a computer system capable of showing multiple views would look completely different.

The Internet and computers, used competently and creatively, make it much easier and cheaper to collect and present multiple truths than in the old world of print, telephone, and snail mail. Multiple-truth Web sites are much more interesting than single-truth Web sites and, per unit of effort and money invested, much more effective at educating users.

Implementing Comments

Comments on articles will be represented in a separate table:

create table comments_on_articles_raw (
	comment_id		integer primary key,
	-- on what article is this a comment?
	refers_to		not null references articles,
	creation_user		not null references users,
	creation_date		not null date,
	language		char(2) references language_codes,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

create view comments_on_articles_approved 
as
select * 
from comments_on_articles_raw
where editorial_status = 'approved';

This table differs from the articles table only in a single column: refers_to. How about combining the two:

create table content_raw (
	content_id		integer primary key,
	-- if not NULL, this row represents a comment
	refers_to		references content_raw,
	-- who contributed this and when
	creation_user		not null references users,
	creation_date		not null date,
	-- what language is this in?
	-- visit http://www.w3.org/International/O-charset-lang
	-- to see the allowable 2-character codes (en is English, ja is Japanese)
	language		char(2) references language_codes,
	-- could be text/html or text/plain or some sort of XML document
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	-- the entire article; 4 GB limit
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

-- if we want to be able to write some scripts without having to think
-- about the fact that different content types are merged

create view articles_approved 
as
select * 
from content_raw
where refers_to is null
and editorial_status = 'approved';


create view comments_on_articles_approved 
as
select * 
from content_raw
where refers_to is not null
and editorial_status = 'approved';

-- let's build a single full-text index on both articles and comments
-- using Oracle Intermedia Text (formerly known as "Context")

create index content_ctx on content_raw (body)
indextype is ctxsys.context;

What is Different about News?

What is so different about news that we need to have a separate table? Oftentimes news has an expiration date, after which it is no longer interesting and should be pushed into an archive. "Pushing into an archive" does not necessarily mean that the item must be moved into a different table. It might be enough to program the presentation scripts so that unexpired news items are on the first page and expired items are available by clicking on "archives".

Often a company's press release will be tagged "for release Monday, April 17." If a publisher wants to continue receiving press releases from this company, it will respect these dates. This implies the need for a release_time column in the news data model.

Other than these two columns (expiration_time and release_time), it would seem that a news story needs more or less the same columns as articles: a place for a one-line summary, a place for the body of the story, a way to indicate authorship, a way to indicate approval within the editorial workflow.

Upon further reflection, however, perhaps these columns could be useful for all site content. An article on upgrading from Windows 2000 to Windows XP probably should be set to expire in 2006. If a bunch of authors and editors are working on a major site update, perhaps it would be nice to synchronize the release of the new content for Tuesday at midnight. Let's go back to content_raw:

create table content_raw (
	content_id		integer primary key,
	refers_to		references content_raw,
	creation_user		not null references users,
	creation_date		not null date,
	release_time		date,	-- NULL means "immediate"
	expiration_time		date,	-- NULL means "never expires"
	language		char(2) references language_codes,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

How do we find news stories amongst all the content rows? What distinguishes a news story with a scheduled release time and expiration date from an article on the Windows 2003 operating system with a scheduled release time and expiration date? We'll need one more column:

create table content_raw (
	content_id		integer primary key,
	content_type		varchar(100) not null,
	refers_to		references content,
	creation_user		not null references users,
	creation_date		not null date,
	release_time		date,
	expiration_time		date,
	language		char(2) references language_codes,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

create view news_current_and_approved
as
select *
from content_raw 
where content_type = 'news'
and (release_time is null or sysdate >= release_time)
and (expiration_time is null or sysdate  <= expiration_time)
and editorial_status = 'approved';

Notice the explicit checks for NULL in the view definition above. You'd think that something simpler such as

and sysdate between release_time and expiration_time

would work. The problem here is SQL's three-valued logic. For the RDBMS to return a row, all of the AND clauses must return true. NULL is not true. Any expression or calculation including a NULL evaluates to NULL. Thus

where sysdate >= release_time

will exclude any rows where release_time is NULL.

What is Different about Discussion?

It seems that we've managed to treat four of the six required content types with one table. What's more, we've done it without having a long list of NULLed columns for a typical item. For an article, refers_to will be NULL. For content that is not temporal, the release and expiration times will be NULL. Otherwise, most of the columns will be filled most of the time.

What about questions and answers in a discussion forum? If there is only one forum on the site, we can simply add rows to the content_raw table with a content_type of "forum_posting" and query for the questions by checking refers_to is null. On a site with multiple forums, we'd need to add a parent_id column to indicate under which topic a given question falls. Within a forum with many archived posts, we'll also need some way of storing categorization, e.g., "this is a Darkroom question". See http://www.photo.net/bboard/ for a running example of a multi-forum system in which older postings are categorized. The "Discussion" chapter of this book treats this subject in more detail.

Why Not Use the File System?

Let's step back for a moment and ask ourselves why we aren't making more use of the hierarchical file system on our server. What would be wrong with having articles stored as .html files in directories? This is the way that most Web sites were built in the 1990s and it is certainly impossible to argue with the performance and reliability of this approach.

One good thing about the file system is that there are a lot of tools for users with different levels of skill to add, update, remove, and rename files. Programmers can use text editors. Designers can use Web design tools and FTP the results. Page authors can use HTML editors such as Microsoft Front Page.

One bad thing about giving many people access to the file system is the potential for chaos. A designer is supposed to upload a template, but ends up removing a script by mistake. Now users can't log into the site anymore. The standard Windows and Unix file systems aren't versioned. It isn't possible to go back and ask "What did this file look like six months ago?" The file system does not by itself support any workflow (see below). You authorize someone to modify a file or not. You can't say "User 37 is authorized to update this article on aquarium filters, but the members shouldn't see that update until it is approved by an editor."

The deepest problem with using the file system as a cornerstone of your content management system is that files are outside of the database. You will need to store a lot of references to content in the database, e.g., "User 960 is the author of Article 231", "Comment 912 is a comment on Article 529", etc. It is very difficult to keep a set of consistent references to things outside the RDBMS. Suppose that your RDBMS tables are referring to file system files by file name. Someone renames a file. The database doesn't know. The database's referential integrity constraint mechanisms cannot be invoked to protect against this circumstance. It is much easier to keep a set of data structures consistent if they are all within the RDBMS.

Static .html files also have the problem of being, well, static. Suppose that you want a standard header and footer on every page. You can cut and paste these into every .html file on the system. But what if you want to change "Copyright 2003" to "Copyright 2006" in the site-wide footer? You may have to update thousands of files. Suppose that you want the header to include a "Login" link if the request comes in with no user authorization cookie and a "Logout" link if the request comes in from a registered user.

Some of the problems with publisher maintenance of static .html files can be solved by periodically writing and running clever Perl scripts. Deeper problems with the user experience remain, however. First and foremost is the fact that with a static .html file every person who views the page thinks that he or she might be the only person ever to have viewed the page. This makes for a very lonely Internet experience and, generally speaking, not a very profitable one for the publisher.

A sustainable online business will typically offer some sort of online community interaction anchored by its content and will offer a consistently personalized user experience. These requirements entail some sort of computer program executing on every page load. So you might as well take this to its logical conclusion and build every URL in your application the same way: script in the file system executes and pulls content from the RDBMS.

Exercise 1

Develop a data model for the content that you'll be storing on your site. Note that at a bare minimum your content repository needs to be capable of handling a discussion forum since we'll be building that in a later chapter.

You might find that, in making the data model precise with SQL table definitions, questions for the client arise. You realize that your earlier discussions with the client were too vague in some areas. This is a natural consequence of building a SQL data model. Pick up the phone and call your client to get clarifications. Email with several alternative concrete scenarios. Get your client accustomed to fielding questions in a timely manner.

Show the draft data model to your teaching assistant and discuss with other students before proceeding.

How the Workflow Problem Arises

It is easy to build and maintain a Web site if

one person is publisher, author, and programmer
the site comprises only a few pages
nobody cares whether these few pages are formatted consistently
nobody cares about retrieving old versions or figuring out how a version got to be the way that it is

Fortunately for companies and programmers that hope to make a nice living from providing content management "solutions", the preceding conditions seldom obtain at better-financed Web sites. What is more typical are the following conditions:

labor is divided among publishers, information designers, graphic designers, authors, and programmers
the site contains thousands of pages
pages must be consistent within sections and sections must have a unifying theme
version control is critical

The publisher decides what major content sections are available, when a content section goes live, and the relative prominence to be assigned each content section.

The information designer decides what navigational links are available from every document on the page, how to present the available content sections, and what graphic design elements are required.

The graphic designer contributes drawings, logos, and other artwork in service of the information designer's objectives. The graphic designer also produces mock-up templates (static HTML files) in which these artwork elements are used.

The programmer builds production templates and computer programs that reflect the instructions of publisher, information designer, and graphic designer.

Editors approve content and decide when specific pages go live. Editors assign relative prominence among pages within sections.

In keeping with their relative financial compensation, we consider the needs and contributions of authors second to last. Authors stuff fragments of HTML, plain text, photographs, music, and sound, into the database. These authored entities will be viewed by users only through the templates developed by the programmers.

Below is an example workflow that we used to assign to students at MIT:

Your "practice project" will be a content management system to support a guide to Boston, along the lines of the AOL City Guide at http://home.digitalcity.com/boston/. You will need to produce a design document and a prototype implementation. The prototype implementation should be able to support the following scenario:

log in as publisher and visit /admin/content-sections/
build a section called "movies" at /movies
build a section called "dining" at /dining
build a section called "news" at /news
log out
log in as information designer and visit /cm and specify navigation. From anywhere in dining, readers should be able to get to movies. From movies, readers should be able to get to dining or news.
log out
log in as programmer and visit /cm
make two templates for the movie section, one called movie_review and one called actor_profile; make one template for the dining section called restaurant_review
log out
log in as author and visit /cm
add two movie reviews and two actor profiles to the movies section and a review of your favorite restaurant to the dining section
log out
log in as editor and visit /cm
approve two of the movie reviews, one of the actor profiles, and the restaurant review
log out
without logging in (i.e., you're just a regular public Web surfer now), visit the /movies section and, ideally, you should see that the approved content has gone live
follow a hyperlink from a movie review to the dining section and note that you can find your restaurant review
log in as author and visit /cm
edit the restaurant review to reflect a new and exciting dessert
log out
visit the /dining section and note that the old (approved) version of the restaurant review is still live
log in as editor and visit /cm and approve the edited restaurant review
log out
visit the /dining section and check that the new (with dessert) version of the restaurant review is being served

A Workflow Problem without Any Work

The preceding section dealt with the problem of supporting the standard publishing world. You know all the authors. They know what they're supposed to write. In an online learning community, especially a non-commercial one, the workflow problem arises before any work has been done or assigned. For example, suppose that the publishers behind the photo.net community decide that they want the following articles:

Basic black and white darkroom photography
Basic color darkroom (color negative)
Making Ilfochrome prints
Hardcore black and white printmaking
Platinum prints

Among the 300,000 people who visit photo.net every month, surely there are people capable of writing each of the preceding articles. We want a system where

Joe User can transactionally sign up to write "Platinum prints", thus marking the article "assignment requested pending editorial approval", supplying a brief outline and committing to completing a draft by July 1.
Jane Editor can approve the outline and schedule, thus generating an email alert back to Joe.
Joe User gets periodic email reminders of what he has signed up to do and by when.
Jane Editor is alerted when Joe's first draft is submitted on July 17 (Joe is unlikely to be the first author in the history of the world to submit work on time).
Joe User gets an email alert asking him to review Jane's corrected version and sign off his approval.
The platinum printing article shows up at the top of Jane Editor's workspace page as "signed off by author" and she clicks to push it live.

Notice the intricacies of the workflow and also the idiosyncracies. The New York Times and the Boston Globe put out very similar-looking products. They are owned by the same corporation. What do you think the chances are that software that supports one newspaper's workflow will be adequate to support the other's?

Exercise 2

Lay out the workflow for each content item that will be user-visible in your online learning community. For each workflow step, specify (1) who needs to give approval, (2) what email alerts are generated, (3) what happens if approval is given, and (4) what happens if approval is denied.

Tip: we recommend modeling workflow as a finite-state machine in which a content item can be in only one state at a time and that single state tells you everything that you need to know about the item. In other words, your software can take action without ever needing to go back and look to see what states the article was in previously.

Version Control (for Content)

Anyone involved in the administration and editing of an online learning community ought to be able to fetch an old version of a content item. If an author complains that a paragraph was dropped, the editors should be able to retrieve the first draft of the article from the content management system. Old versions are sometimes useful for public users as well. For example, on photo.net in the mid-1990s we had a lot of classified ads whose subject lines were of the form "Reduced to $395!" A check through the server logs revealed that the ad had been posted earlier that day with a price of $400, then edited a few hours later. So technically the subject line was true, but it was misleading. Instead of hiring additional administrators to notice this kind of problem, we changed the software to store all previous versions of a classified ad. When presenting an ad that had been edited, the new scripts offered a link to view old versions of the ad. The practice of screaming "Reduced!" stopped.

Version control becomes critical for preventing lost updates when people are working together. Here's how a lost update can happen:

Ira grabs Version A of a document at 9:00 am from the Web site in order to fix a typo. He fixes it at 9:01 am, but forgets to write the document back to the Web site.
Shoshana grabs Version A at 10:00 am and spends six hours adding a chapter of text, writing it back at 4:00 pm (call this Version B).
Ira notices that he forgot to write his typo correction back to the server and does so at 5:00 pm (call this Version C).

Unfortunately, Version C (the typo fix) is what future users will see; all of Shoshana's work was wasted.

Programmers and technical writers at large companies are familiar with the problem of lost updates when multiple people are editing the same document. File-system based version control systems were developed to help coordinate multiple contributors. These systems include the original Walter Tichy's Revision Control System (RCS; early 1980s), Dick Grune and Brian Berliner's Concurrent Versions System (CVS; 1986), and Marc Rochkind's Source Code Control System (SCCS; 1972). These systems require more training than is practical for casual users. For example, RCS mandates explicit check-out and check-in. While a file is checked out by User A it is locked and nobody but User A can check it back in. Suppose that User A goes out to lunch, but there is some important news that absolutely must be put on the site. What if User A leaves for a two-week vacation and forgets to check a bunch of files back in? These problems can be worked around manually, but it becomes a challenge when the collaborators are on opposite sides of the globe and cannot see "Oh, Schlomo's coat is still on the back of his chair so he's not yet left for the day."

For distributed authorship of Web content by geographically distributed casually connected users, the most practical system turns out to be one in which check-in is allowed at any time by any authorized person. However, all versions of every document are kept in the database so that one can always revert to an earlier version or pull a section out of an earlier version. This implies that your content management system will have an audit trail: a record of past values held by row-column intersections in a database table, who was responsible for any changes in those values, and when the values were changed.

There are two classical ways to implement an audit trail in an RDBMS. The first is to set up separate audit tables, one for each production table. Every time an update is made to a production table, the old row is written out to an audit table, with a time stamp. This can be accomplished transparently via RDBMS triggers, which are described in the "Triggers" chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/triggers and demonstrated in practice in an open-source audit trail package documented at http://philip.greenspun.com/seia/examples-content-management/audit-acs-doc. The second classical approach is to keep current and archived information in the same table. This is more expensive in terms of computing resources required because the information that you want for the live site is interspersed with seldom-retrieved archived information. But it is easier if you want to program in the capability to show the site as it was on a particular day. Your templates won't have to query a different table, they will merely need a different WHERE clause.

Michael Stonebraker, a professor at University of California Berkeley, looked at this problem around 1990 and decided to build an RDBMS with, among other advanced features, native support for versioning. This became the PostgreSQL open-source RDBMS. The original PostgreSQL had a "no-overwrite architecture" in which a change to a row resulted in a complete new version of that row being written out to the disk. Thus the hard disk drive contained all previous versions of every row in the table. A programmer could select * from content_table['epoch','1995-01-01'] ... to get all versions from the beginning of time ("epoch") until January 1, 1995. This innovation made for some nice articles in academic journals, but execrable transaction processing performance. The modern PostgreSQL scrapped this idea in favor of Oracle-style write-ahead logging in which only updates are written to the hard drive (see the "Write-Ahead Logging" chapter of the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/wal.html).

Second Normal Form

Suppose that you decide to keep multiple versions in a single content repository table:

create table content_raw (
	content_id		integer primary key,
	content_type		varchar(100) not null,
	refers_to		references content_raw,
	creation_user		not null references users,
	creation_date		not null date,
	release_time		date,
	expiration_time		date,
	-- some of our content is geographically specific 
	zip_code		varchar(5),
	-- a lot of our readers will appreciate Spanish versions
	language		char(2) references language_codes,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	-- let's use BLOB in case this is a Microsoft Word doc or JPEG
	-- a BLOB can also hold HTML or plain text
	body			blob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

If this table were to contain seven versions of an article with a Content ID of 5657 that would violate the primary key constraint on the content_id column. What if we remove the primary key constraint? In Oracle this prevents us from establishing referential integrity constraints pointing to this ID. With no integrity constraints, we will be running the risk, for example, that our database will contain comments on content items that have been deleted. With multiple rows for each content item, our pointers become ambiguous. The statement "User 739 has read Article 5657" points from a specific row in the users table into a set of rows in the content_raw. Should we try to be more specific? Do we want a comment on an article to refer to a specific version of that article? Do we want to know that a reader has read a specific version of an article? Do we want to know that an editor has approved a specific version of an article? It depends. For some purposes, we probably do want to point to a version, e.g., for approval, and at other times we want to point to the article in the abstract. If we add a version_number column, this becomes relatively straightforward.

create table content_raw (
	-- the combination of these two is the key
	content_id		integer,
	version_number		integer,
	...
	primary key (content_id, version_number)

);

Retrieving information for a specific version is easy. Retrieving information that is the same across multiple versions of a content item becomes clumsy and requires a GROUP BY, since we want to collapse information from several rows into a one-row report:

-- note the use of MAX on VARCHAR column; this works just fine

select content_id, max(zip_code)
from content_raw
where content_id = 5657
group by content_id

We're not really interested in the largest ZIP code for a particular content item version. In fact, unless there has been some kind of mistake in our application code, we assume that all ZIP codes for multiple versions of the same content item are the same. However, GROUP BY is a mechanism for collapsing information from multiple rows. The SELECT list can contain column names only for those columns that are being GROUPed BY. Anything else in the SELECT list must be the result of aggregating the multiple values for columns that aren't GROUPed. The choices with most RDBMSes are pretty limited: MAX, MIN, AVERAGE, SUM. There is no "pick any" function. So we use MAX.

Updates are similarly problematic. The U.S. Postal Service periodically redraws the ZIP code maps. Updating one piece of information, e.g., "20016" to "20816", will touch more than one row per content item.

This data model is in First Normal Form. Every value is available at the intersection of a table name, column name, and key (the composite primary key of content_id and version_number). However, it is not in Second Normal Form, which is why our queries and updates appear strange.

In Second Normal Form, all columns are functionally dependent on the whole key. Less formally, a Second Normal Form table is one that is in First Normal Form with a key that determines all non-key column values. Even less formally, a Second Normal Form table contains statements about only one kind of thing.

Our current content_raw table contains some information that depends on the whole key of content_id and version_number, e.g., the body and the language code. But much of the information depends only on the content_id portion of the key: author, creation time, release time, ZIP code.

When we need to store statements about two different kinds of things, it makes sense to create two different tables, i.e., to use Second Formal Form:

-- stuff about an item that doesn't change from version to version
create table content_raw (
	content_id		integer primary key,
	content_type		varchar(100) not null,
	refers_to		references content_raw,
	creation_user		not null references users,
	creation_date		not null date,
	release_time		date,
	expiration_time		date,
	mime_type		varchar(100) not null,
	zip_code		varchar(5)
);

-- stuff about a version of an item
create table content_versions (
	version_id		integer primary key,
	content_id		not null references content_raw,
	version_date		date not null,
	language		char(2) references language_codes,
	one_line_summary	varchar(200) not null,
	body			blob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired')),
	-- audit the person who made the last change to editorial status
	editor_id		references users,
	editorial_status_date	date
);

How does one query into the versions table and find the latest version? A first try might look something like the following:

select * 
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and version_date = (select max(version_date)
                    from content_versions
                    where content_id = 5657
                    and editorial_status = 'approved')

Is this guaranteed to return only one row? No! There is no unique constraint on content_id, version_date. In theory, two editors or authors could submit new versions of an item within the same second. Remember that the date datatype in Oracle is precise only to within one second. Even more likely is that an editor doing a revision might click on an editing form submit button twice with the mouse or perhaps use the Reload command impatiently. Here's a slight improvement:

select * 
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and version_id = (select max(version_id)
                  from content_versions
                  where content_id = 5657
                  and editorial_status = 'approved')

The version_id column is constrained unique, but we're relying on unstated knowledge of our application code, i.e., that version_id will be larger for later versions.

Some RDBMS implementations have extended the SQL language so that you can ask for the first row returned by a query. A brief look at the Oracle manual would lead one to try

select * 
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and rownum = 1
order by version_date desc

but a deeper reading of the manual would reveal that the rownum pseudo-column is set before the ORDER BY clause is processed. An accepted way to do this in one query is the nested SELECT:

select *
from (select * 
      from content_versions
      where content_id = 5657
      and editorial_status = 'approved'
      order by version_date desc)
where rownum = 1;

Another common style of programming in SQL that may seem surprising is taking the following steps:

open a cursor for the SQL statement

select * 
from content_versions
where content_id = 5657
and editorial_status = 'approved'
order by version_date desc

fetch one row from the cursor (this will be the one with the max value in version_date)
close the cursor

Third Normal Form

An efficiency-minded programmer might look at the preceding queries and observe that a content version is updated at most ten times per year, whereas the public pages may be querying for and delivering the latest version ten times per second. Wouldn't it make more sense to compute and tag the most current approved version at insertion/update time?

create table content_versions (
	version_id		integer primary key,
	content_id		not null references content_raw,
	version_date		date not null,
	...
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired')),

	current_version_p	char(1) check(current_version_p in ('t','f')),
	...
);

The new current_version_p column can be maintained via a trigger that runs after insert or update and examines the version_date and editorial_status columns.

Querying for user pages can be simplified with the following view:

create view live_versions 
as
select * 
from content_versions
where current_version_p = 't';

Modern commercial RDBMS implementations offer a feature via which rows in a table can be spread across different tablespaces, each of which is located on a physically separate disk drive. In Oracle, this is referred to as partitioning:

create table content_versions (
	version_id		integer primary key,
	content_id		not null references content_raw,
	version_date		date not null,
	...
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired')),

	current_version_p	char(1) check(current_version_p in ('t','f')),
	...
)
partition by range 
  (current_version_p)
  (partition old_crud values less than 's'
   tablespace slow_extra_disk_tablespace
   partition live_site values less than(maxvalue)
   tablespace fast_new_disk_tablespace)
;

All of the rows for the live site will be kept together in relatively compact blocks. Even if the ratio of old versions to live content is 99:1 it won't affect performance or the amount of RAM consumed for caching database blocks from the disk. As soon as Oracle sees a "WHERE CURRENT_VERSION_P =" clause it knows that it can safely ignore an entire tablespace and won't bother checking any of the irrelevant blocks.

Have we reached Nirvana? Not according to the database eggheads, whose relational calculus formulae do not embrace such factors as how data are spread among physical disk drives. The database theoretician would note that our data model is in Second Normal Form but not in Third Normal Form. In a table that is part of a Third Normal Form data model, all columns are directly dependent on the whole key. The column current_version_p is not dependent on the table key, but rather on two other non-key columns (editorial_status and version_date). SQL programmers refer to this kind of performance-enhancing storage of derivable data as "denormalization".

If you want to serve ten million requests per day directly from an RDBMS running on a server of modest capacity, you may need to break some rules. However, the most maintainable production data models usually result from beginning with Third Normal Form and adding a handful of modest and judicious denormalizations that are documented and justified.

Note that any data model in Third Normal Form is also in Second Normal Form. A data model in Second Normal Form is in First Normal Form.

Version Control (for Computer Programs)

Note that a solution to the version control problem for site content (stuff in the database) still leaves you, as an engineer, with the problem of version control for the computer programs that implement the site. These are most likely in the operating system file system and are edited by a handful of professional software developers. During this class you may decide that it is not worth the effort to set up and use version control, in which case your de facto version control system becomes backup tapes, so make sure that you've got daily backups. However, in the long run you need to learn about approaches to version control for Internet application development.

Throughout this section, keep in mind that a project with a very clear publishing objective, specs that never change, and one very smart developer, does not need version control. A project with evolving objectives, changing specifications, and multiple contributors needs version control.

Classical Solution: one development area per developer

Classically, version control is used by C developers with each C programmer working from his or her own directory. This makes sense because there is no persistence in the C world. Code is compiled. A binary runs that builds data structures in RAM. When the program terminates, it doesn't leave anything behind. The entire "tree" of software is checked out from a version control repository into the file system of the development computer. Changed files are checked back into the repository when the programmer is satisfied.

A shallow objection to this development method in the world of database-backed Internet applications is that it becomes very tedious to make a small change. The programmer checks out the tree onto a development server. The programmer installs an RDBMS, then creates an RDBMS user and a tablespace. The programmer exports the RDBMS from the production site into a dump file, transfers that dump file over the network to the development machine, and imports it into the RDBMS installation on the development server. Keep in mind that for many Internet applications the database may approach one terabyte in size, and therefore it could take hours or days to transfer and import the dump file. Finally, the programmer finds a free IP address or port and sets up an HTTP server rooted at the development tree. Ready to code!

A deeper objection to applying this development method to our world is that it is an obstacle to collaboration. In the Internet application business, developers always work with the publisher and users. Those collaborators need to know, at all times, where to find the latest running version of the software so that they can offer criticism and advice. If there are ten software developers on a service it is not reasonable to ask the publishers and users to check ten separate development sites.

A Solution for Our Times

three HTTP servers (they can be on one physical computer)
two or three RDBMS users/tablespaces (they can be in one RDBMS instance)
one version control repository

Let's go through these item by item.

Item 1: Three HTTP Servers

Suppose that a publisher's overall objective is to serve an Internet application accessible at "foobar.com". This requires a production server, rooted in the file system at /web/foobar/ (Server 1). It is too risky to have programmers making changes on the live production site. This requires a development server, rooted at /web/foobar-dev/ (Server 2). Perhaps this is enough. When everyone is happy with the way that the dev server is functioning, declare a code freeze, test a bit, then copy the dev code over to the production directory and restart.

What's wrong with the two-server plan? Nothing if the development and testing teams are the same, in which case there is no possibility of simultaneous development and testing. For a complex site, however, the publisher may wish to spend a week testing before launching a revision. It isn't acceptable to idle authors and developers while a handful of testers bangs away at the development server. The addition of a staging server, rooted at /web/foobar-staging/ (Server 3) allows development to proceed while testers are preparing for the public launch of a new version.

Here's how the three servers are used:

developers work continuously in /web/foobar-dev/
when the publisher is mostly happy with the development site, a named version or branch is created and installed at /web/foobar-staging/
the testers bang away at the /web/foobar-staging/ server, checking fixes back into the version control repository but only into the staging branch
when the testers and publishers sign off on the staging server's performance, the site is released to /web/foobar/ (production)
any fixes made to the staging branch of the code that have not already been fixed by the development team are merged back into the development branch in the version control repository

Item 2: Two or Three RDBMS Users/Tablespaces

Suppose that the publisher has a working production site running version 1.0 of the software. One could connect the development server rooted at /web/foobar-dev/ to the production database. After all, the raison d'être of the RDBMS is concurrency control. It will be happy to handle eight simultaneous connections from a production Web server plus two or three from a development server. The fly in this ointment is that one of the developers might get sloppy and write a program that sends drop table users rather than drop table users_experimental_extra_table to the database. Or, less dramatically, a junior developer might leave out a WHERE clause in an SQL statement and inadvertently request a result set of 10⁹ rows, thus slowing down the production site.

So it would seem that this publisher will need at least one new database. Here are the steps:

create a new database user and tablespace; if this is on a separate physical computer from your production RDBMS server it will protect your production server's performance from inadvertent denial-of-service attacks by sloppy development SQL statements
export the production database into a file system file, which is a good periodic practice in any case as it will verify the integrity of the database
import the database export into the new development database
every time that a developer alters a table, adds a table, or populates a new table, record the operation in a "patches.sql" file
when ready to move code from staging to production, hastily apply all the data model modifications from patches.sql to the production RDBMS

Should there be three databases, i.e., one for dev, one for staging, and one for production? Not necessarily. Unless one expects radical data model evolution it may be acceptable to use the same database for development and staging. Keep in mind that adding a column to a relational database table seldom breaks old queries. This was one of the objectives set forth by E.F. Codd in 1970 in "A Relational Model of Data for Large Shared Data Banks" (http://www.acm.org/classics/nov95/toc.html) and certainly modern implementations of the relational model have lived up to Codd's hopes in this respect.

Item 3: One Version Control Repository

The function of the version control repository is to

remember what all the previous checked-in versions of a file contained
show the difference between what's in a checked-out tree and what's in the repository
help merge changes made simultaneously by multiple authors who might have been unaware of each other's work
group a snapshot of currently checked-in versions of files as "Release 2.1" or "JuneIssue"

An example of a system that meets the preceding requirements is Concurrent Versions System (CVS), which is free and open-source. CVS uses a single file system directory as its repository or "CVS root". CVS can run over the Internet so that the repository is on Computer A and dev, staging, and prod servers are on Computers B, C, and D. Alternatively, you can run everything in separate file system directories on one physical computer.

Good things about this solution

Let's summarize the good things about the version control (for computer programs) solution proposed here:

if something is screwy with the production server, one can easily revert to a known and tested version of the software
programmers can protect and comment their changes by explicitly checking files in after significant changes
teams of programmers and testers can work indepently

Further reading: Open Source Development With CVS (Fogel and Bar 2001; Coriolis), a portion of which is available online at http://cvsbook.red-bean.com/cvsbook.html.

Exercise 3: Version Control

Write down your answers to the following questions:

What is your system for versioning content?
What is your system for versioning the software behind your application, including data model and page scripts?
What kind of answer can your system produce to the question "Who is responsible for the content on this current user-visible page?"

Note that generally most teams must write some additional SQL code to complete this exercise, augmenting the data model that they built in Exercise 1.

Exercise 4: Skeletal Implementation

Build enough of the pages so that a group of users can cooperate to put a few pieces of content live on your server. Focus your efforts on the primary kinds of publisher-authored content that you expect to have in your online learning community. For most projects, this will be articles and navigation pages to those articles.

After you've got a few articles in, step back and ask the following questions:

Is this data model working?
Is it taking a reasonable number of clicks to get some content live?
Do the people who need to approve new content have an easy way of figuring out what needs approval and what has been approved or rejected already? Must those editors come to the site every few hours and check or will they get email alerts when new content needs review?

A skeletal implementation should have stable and consistent URLs, i.e., the home page should be just the hostname of the server and filenames should be consistent. If you haven't had a chance to make abstract URLs work (see the "Basics" chapter), this is a good time to do it. Every page should have a descriptive title so that the browser's Back button and bookmarks ("favorites") are fully functional. Every page should have a "View Source" link at the bottom and a way to contact the persons responsible for page function and content. Some sort of consistent navigation system should be in place (also see below). The look and feel of a skeletal implementation will be plain, but it need not be ugly or inconsistent. Look to Google for inspiration, not the personal home pages of fellow students at your university.

Look and Feel

At this point you have some content on your server. It is thus possible to begin experimenting with the look and feel of HTML pages. A good place to start is with the following issues:

space
time
words
color

Screen Space

In the 1960s a computer user could tap into a 1/100th share of a computer with 1 MB of memory and capable of executing 1 million instructions per second, viewing the results on a 19-inch monitor. In 2005, a computer user gets a full share of a computer with 2000 MB of memory (2 GB) and capable of executing 4 billion instructions per second. This is roughly a 400,000-fold improvement in available computing capability. How does our modern computer user view the results of his or her computations? On a 19-inch monitor.

Programmers of most applications no longer need concern themselves too much with processor and memory efficiency, which were obsessions in the 1960s. CPU and RAM are available in abundance. But screen real estate is as precious as ever. Look at your page designs. Is the most important information available without scrolling? (In the newspaper business, the term for this is "above the fold".) Are you making the best use of the screen space that you have? Are there large swaths of empty space on the page? Could you be using HTML tables to present two or three columns of information at the same time?

One particularly egregious waste of screen space is the use of icons. Typically, users can't understand what the icons mean so they need to be supplemented with plain language annotation. Generally the best policy is to let the information be the interface, e.g., display a list of article categories (the information) where clicking on a category is the way to navigate to a page showing articles within that category.

Time

Most people prefer fast to slow. Most people prefer consistent service time to inconsistent service time. These two preferences contribute substantially to the popularity of McDonald's restaurants worldwide. When people are done with their lunch they bring those same preferences to computer applications: fast is better than slow; response time should be consistent from session to session.

Computer and network speeds will change over the years but human beings will evolve much more slowly. Thus we should start by considering limits derived from the humanity of our users. The experimental psychologists will tell us that short-term memory is good for remembering only about seven things at once ("The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information", George A. Miller, The Psychological Review 1956, 63:81-97; http://www.well.com/user/smalin/miller.html) and that this memory is good for only about twenty seconds. It is thus unwise to build any computer application in which users are required to remember too much from one page to another. It is also unwise to build any computer application where the interpage delay is more than twenty seconds. People might forget what task they were trying to accomplish!

IBM Corporation carried out some studies around 1970 and discovered the following required computer response times:

0.1 seconds for direct manipulation, e.g., moving objects around on a screen with a pointer
1 second for maximum productivity in screen-click-screen systems such as they had on the IBM 3270 terminal back in 1970 and we have on the Web in 2005
less than 10 seconds to hold the full attention of a user; when response times extended beyond 10 seconds users would try to engage in another task, such as reading a magazine, while also using the computer application

A reasonable goal to strive for in an Internet application is sub-second response time. This goal is based partly on IBM's research, partly on the inability to achieve (in 2005) the 0.1-second mark at which direct manipulation becomes possible, and partly on what is being achieved by the best practitioners. Your users will have used Amazon and Yahoo! and eBay. Any service that is slower than these is going to set off alarm bells in the user's mind: maybe this site is going to fail altogether? Maybe I should try to find a competitive site that does the same job but is faster?

One factor that affects page-loading time is end-to-end bandwidth between your server and the user. You can't do much about this except measure and average. Some Web servers can be configured or reprogrammed to log the total time spent serving a page. By looking at the times spent serving large photographs, for example, you can infer average bandwidth available between your server and the users. If the tenth percentile users are getting 50 Kbits per second, you know that, even if your server were infinitely fast at preparing pages, you should try to make sure that your pages, with graphics, are either no larger than 50 Kbits in size or that the HTML is designed such that the page will render incrementally. (A page that is one big TABLE is bad; a page in which any images have WIDTH and HEIGHT tags is good because the text will be rendered immediately with blank spaces that will be gradually filled in as the images are loaded.)

You can verify your decisions about page layout and graphics heaviness by comparing your pages to those of the most successful Internet service operators such as eBay, Yahoo!, and Amazon.

Remember that in the book and magazine world every page design loads at the same speed, which means that page design is primarily a question of aesthetics. In the Internet world page design and application speed are inextricably linked, which makes page design an engineering problem.

Words

As a programmer, there are two kinds of text that you will be putting into the services that you build: instructions and error messages.

For instructions, you can choose active or passive voice and first, second, or third person. Instructions should be second person imperative. Leave out the pronouns, e.g., "Enter departure date" rather than "Enter your departure date".

Oftentimes you can build a system such that error messages are unnecessary. The best user interfaces are those where the user can't make a mistake. For example, suppose that an application needs to prompt for a date. One could do this with a blank text entry box and no hint, expecting the user to type MM/DD/YYYY, e.g., 09/28/1963 for September 28, 1963. If the user's input did not match this pattern or the date did not exist, e.g., 02/30/2002, the application returns a page explaining the requirements. A minor improvement would be to add a note next to the box: "MM/DD/YYYY". If the application logs showed that the number of error pages served was reduced, but not eliminated, perhaps defaulting the text entry box to today's date in MM/DD/YYYY format would be better. Surf over to your favorite travel site, however, and you'll probably find that they've chosen "none of the above". Users are asked to pick a date from a JavaScript calendar widget or pull down month and day from HTML menus.

Figure 6.1: Different ways of asking the user to specify a date. Generally it is best to ask in such a way that the user cannot possibly make a mistake and necessitate the serving of an error page reading "date not properly formatted", "invalid date", or "date in the past".

Sadly, you won't be able to eliminate the need for all error messages. Thus you'll have to make a choice between terse or verbose and between lazy or energetic. A lazy system will respond "syntax error" to any user input that won't work. An energetic system will try to autocorrect the user's input or at least figure out what is likely to be wrong.

Studies have shown that it is worthwhile to develop sophisticated error-handling pages, e.g., ones that correct the user's input and serve a confirmation page. At the very least, it is worth running some regular expressions against the user's offending input to see if its defects fall into a common pattern that can be explained on an error page. It is best to avoid anthropomorphism—the computer shouldn't say "I didn't understand what you typed".

Color

"The natural world is too green and badly lit."
-- Francois Boucher, 18th century painter

Text is most readable when it is black against a white or off-white background. It is best to avoid using color as part of your interface with the exception of sticking with conventions such as "blue text = hyperlink; purple text = visited hyperlink". If you limit your creativity to <body bgcolor=white text=black>, the browser will treat your users kindly with familiar link colors. By this sparing use of color in your interface you'll have most of the color spectrum available for presenting information: charts, graphs, photos. Compare www.britneyspears.com and http://britneyspears.ac/physics/basics/basics.htm, for example, to see these principles at work.

Be a bit careful with medium gray tones at the very top of Web pages. Many Web browsers use various shades of gray for the backgrounds of menu and button bars at the top of windows. If a Web page has a solid gray area at the top, a user may have trouble distinguishing where the browser software ends and the page content begins. Notice that pages on Yahoo! and Amazon include a bit of extra white space at the top to separate their page content from the browser location and menu bars.

Whatever scheme you choose, keep it consistent site-wide. In 1876 MIT agreed on cardinal and gray for school colors. See how the agreement is holding up by visiting www.mit.edu, click on "Administration" and then look at the subsites for four departments: IS, Medical, Arts, Disabilities Service.

For an excellent discussion of the use of color, see Macintosh Human Interface Guidelines, available online at http://developer.apple.com/documentation/mac/HIGuidelines/HIGuidelines-2.html. Basically the messages are the following: (1) use color sparingly, (2) make sure that a colorblind person can make full use of the application, and (3) avoid red because of its association with alerts and danger.

Navigation

As with page design, the best strategy for navigation is to copy the most successful and therefore familiar-to-your-users Internet applications. Best practice for a site home-page circa 2005 seems to boil down to the following elements:

a navigation directory to the rest of the site
news and events
a single text input box for site-wide search
a quick form targeting the most frequently requested service on the site, e.g., on an airline site, a quick fare/schedule finder with form inputs for cities and dates

In building the navigation directory, look at www.yahoo.com. Note that Yahoo! does not use icons for category navigation. To get to the photography category, underneath Arts & Humanities, you click on the word "Photography". The information is the interface. This principle is articulated in Edward Tufte's classic Visual Explanations (Graphics Press, 1997). Tufte notes that if you were to have icons you'd also need a text explanation underneath. Why not let the text alone be the interface? Tufte also argues for broad and flat presentation of information; a user shouldn't have to click through eight screens each with only a handful of choices.

On interior pages, it is important to answer the following questions:

Where am I?
Where have I been?
Where can I go?

To answer "Where am I?" relative to other sites on the Internet, you can include a logo graphic or font-distinguished site name in the upper left corner of each page, hyperlinked to the site home-page. See the interior pages at amazon.com for how this works. To answer "Where am I?" relative to other pages on the same site, you can include a site map with the current page highlighted. On a complex site, this won't scale very well: better to use the Yahoo-style navigation bar, also known as "hierarchical path" or "bread crumbs". For example, http://dir.yahoo.com/Arts/Visual_Arts/Photography/Panoramic/ contains the following navigation bar:

Home > Arts > Visual Arts > Photography > Panoramic

Note that this bar grows in size as O[log N] where N is the number of pages on the site. Showing a full site map or top tabs results in linear growth.

To answer "Where have I been?", start by not instructing the browser to change the standard link colors. The user will thus be cued by the browser for any links that have already been visited. If you're careful with your programming and consistent with your page titles, the user will be able to right-click on the Back button and optionally return to any previous place on your service. Note further that the Yahoo-style navigation bar is effective at answering "Where have I been?" for users who have actually clicked down from the home page.

To answer "Where can I go?" you need ... links! Let the browser default to standard colors so that users will perceive the links as links. It is generally a bad idea to use rollovers, select boxes, or graphics. These controls won't work the same from site to site and therefore users may not understand how to use them. These controls don't have the property that visited links turn a different color; they generally can't or don't tap into the browser's history database. Finally, these controls aren't effective at showing the user where he or she can go because many of the choices are hidden.

Exercise 5: Criticism

Take or get a tour of the other projects being built by your classmates in this course. For each project make sure that you familiarize yourself with the overall service objectives and the data model. Then register as a user and author an article. (If you get stuck on any of these steps, contact the team members behind the project by phone and email and ask them to add links or hints to their server.)

Working with your project team members, write a plain-text critique of each project that you review. Look for situations in which the client's requirements, as expressed in the planning exercise solutions, can't be fulfilled with the data model that you see. Look for opportunities to provide constructive criticism. Remember that your classmates don't need a self-esteem boost; they need the benefit of your engineering skills.

Here are some suggested areas where it might be easy to find improvements:

page flows in user registration and content authoring—could the number of clicks to accomplish a task be reduced?
look, feel, and navigation referenced to the standards outlined above
version control and audit trail
where do/should attachments go, e.g., is there a place to store a JPEG photo attached to a comment on an article?
categorization and presentation hints—can the content be presented within a clear information architecture?
is there a place to store keywords, i.e., hand-authored collections of words associated with a content item (to aid full-text search)
can the content repository store an arbitrary data type, e.g., a video, an audio clip, or a photograph?

Sign the critique with the name of your project team and also the names of all team members.

Email your critique to the team members whose work you've just reviewed. Archive these in a file and make them available at http://yourservername/doc/critiques/cm-sent.txt. Watch your own inbox for critiques coming in from the rest of the class. Please assemble these into one file and make them available at http://yourservername/doc/critiques/cm-received.txt

Information Architecture: Implicit or Explicit?

Suppose that there are 1000 content items on a site. The manner of organizing, labeling, and presenting these 1000 items to a user is referred to as the information architecture of the site. For the sake of simplicity, let's start by assuming that we will be presenting all 1000 items on one page. For the sake of concreteness, we'll assume that all the content is related to photography. Even this degenerate one-page user experience requires some information architecture decisions. Here are a few possibilities:

sort from newest to oldest (good for experienced users)
sort from highest quality to lowest quality (might be good for first-time users)
categorize by what's in front of the camera and present the items separated by subheadlines, e.g., "Portraits", "Architecture", "Wedding", "Family", "Animals"
categorize by type of camera used and present items separated by subheadlines such as "Digital point and shoot", "Digital SLR", "35mm point and shoot", "35mm SLR", "Medium Format", "Large Format"

Information architecture decisions have a strong effect on the percentage of users who say "I got my questioned answered." Most studies of corporate Web sites, all of which owe their tested form to hundreds of thousands of dollars in design work, find that users have less than a 50 percent chance of finding the answer to questions that are in fact answerable from documents present on the site. We redid the information architecture on the photo.net site, a change that touched only about six top-level pages, and the number of new users registering each day doubled.

One reason that the information architecture on a typical site is so ill-suited to the user is that the architecture is implicit in scripts and HTML pages. To test an alternative would involve expensive hand-manipulation of computer programs and HTML markup. To offer an individual user or class of user a custom information architecture would be impossible.

What if we represented information architecture explicitly in database tables? These tables would hold the following information:

information about information architectures: who made them, when, which ones are current and for whom
whether items underneath a category or subcategory, within a given information architecture (IA), should be presented in-line on one page or merely summarized with links down to separate pages for each item
where a content item fits in a given IA: what subcategory (category can be inferred from the subcategory), what presentation order ("sort key") compared to other items at the same level
how a content item or category should be described

With such a large part of the user experience driven from database tables, testing an alternative is as easy as inserting some rows into the database from the information architecture admin pages. If during a site's conceptualization people can't agree on the best categorization of content, it becomes possible to launch with two alternatives. Half the users see IA 1 and half see IA 2. If users who've experienced IA 1 are more likely to register and return, we can assume that IA 1 is superior, at least for first-time users.

For the application that you build in this course, it is acceptable to take the expedient path of pounding out scripts with an implicit information architecture. However, we'd like you to be aware of the power for development and testing that can be gained from an explicit information architecture.

Exercise 6: The Lived-In Look

A skeletal prototype has one big limitation: it is skeletal. Incorporating the feedback that you've gotten from other students (in Exercise 5) and from instructors, beef up your content management system while simultaneously pouring in enough content that your application has a "lived-in" look. This will ensure that your system truly is powerful enough to handle the users' basic needs.

If your client has an existing site, use that as a source of content and minimum requirements. Also look at a couple of sites run by organizations with comparable missions and sizes. For example, if you're building something for an academic group you might look at Harvard University's Department of Molecular and Cellular Biology's Web site at http://www.mcb.harvard.edu/. This site illustrates the basic requirements for a medium-sized organization's Web site. An "overview" section describes the department's purpose and history. A "news" section offers press releases. A "faculty" section explains who works there and what their specialties are. There are also sections for prospective undergraduates and graduate students, i.e., the potential customers for this organization. If you're building something for a small non-profit organization, look at the Web sites for Sustainable Harvest (www.sustainableharvest.org) and the Southern Animal Rescue Association (www.sarasanctuary.org). If you're working for a small manufacturing company, look at www.cirrusdesign.com, the Web site for Cirrus Design Corporation, a Duluth, Minnesota maker of small airplanes.

What if you can't reach your client in time to complete the assignment? Or if you can't get content from your client? Use content from their existing site or a site operated by a similar organization. Make sure that at a minimum there is a lived-in look for a reader who comes to see the "About", "News", and "Contact Us" sections. During the remainder of the course you'll have an opportunity to replace the placeholder content with content from your client.

Note that before embarking on this you may want to read at least the "Separating the Designers and the Programmers" section on templates in the "Software Modularity" chapter.

Exercise 7: Client Signoff

Ask your client to register as a user and try out the "lived-in" site. Most people have a difficult time designing on a blank sheet of paper. You'll get new and different insights from your client by showing them a partially finished site than you did at the beginning of the project.

Record your client's answers to the following questions:

What changes would you like to see in the plan, now that you've tried out the prototype?
What will be the fastest way to fill this site with real content?
Are we collecting the right amount of information on initial user registration?

Presenting Your Work

If you're enrolled in a course using this textbook, you'll probably be asked at this point to give a four-minute presentation of your work on the content management system and skeletal implementation of the site.

Four minutes isn't very long so you'll need to rehearse and you'll want to make sure that all team members know what they're supposed to do. As a general rule, the person speaking should be addressing the audience, not typing at a computer. Team Member A talks; Team Member B drives. Perhaps at some point in the presentation they switch, but nobody is ever talking and driving a computer at the same time.

Open with an "elevator pitch", i.e., the kind of thirty-second explanation that you'd give to someone you met during an elevator ride. The pitch should explain what problem you're solving and why your system will be better than existing mechanisms available to people.

Create one or more users ahead of time so that you don't have to show your user registration pages. Everyone who has used the Internet has registered at sites. They'll assume that you copied the best practices from amazon.com and other popular sites. If you did, the audience will be bored. If you didn't, the audience will be appalled by your sloppiness. Either way it is best to log in as already-registered users. In fact, sometimes you can arrange to prepare two browsers, e.g., Mozilla and MSIE, one of which is logged in as a new user of the service and one of which is logged in as a site administrator or some other role that you want to demonstrate.

It is best not to refer to "users" during your talk. Instead talk about the roles by name. If, for example, you are building a service around flying, you could say "A student pilot logs in [your teammate logs in], finds an article on flight schools in San Francisco [your teammate navigates to this article], and posts a comment at the bottom about how much he likes his particular instructor." Then perhaps swap positions and your teammate comes up to say "The site editor [you switch browsers to the one logged in as a site admin], clicks on the new content page [you click], sees that there are some new comments pending approval, reads this one from a student pilot, and approves it [you click]." You return the browser to the public page where the comment may now be seen in the live site.

Close by parking the browser at a page that reveals as much of the site's overall structure as possible. Don't despair if you weren't able to show every feature of what you've built. Computer applications are all about the tasks that can be accomplished. If you've made the audience believe that it will be easy to complete a few clearly important tasks, you will have instilled confidence in them.

Exercise 8 (For the Instructor)

Call up each team's clients and ask how strongly they agree with the following statements:

I believe that my student team understands my problem.
I understand what my student team is planning to accomplish and by what dates, right through the end of the course.
My student team has been well-prepared for our meetings.
My student team is responsive.
I believe that the content management system my student team has built will be adequate to support the types of documents on my site and the workflow required for publishing those documents.
I think it is easy for users to register at my site, to recover a lost password, and that users are being asked all the required personal information.
I like the user administration pages that my student team has built.
My student team has made it easy for me to check on their progress myself.
My student team has kept me well informed of their progress.
I am impressed by the clarity and thoroughness of the documentation prepared so far.

Score this exercise by adding scores from each question: 0 for "disagree" or wishy-washy agreement (clients won't want to say bad things about young volunteers), 1 for "agree", 2 for "strongly agree".

Time and Motion

The data modeling, workflow, and version control exercises are intended to be done by the entire team working together. They should take about three hours. Many projects will need to do little more than adapt data models and policies from this chapter and put them in their own server's /doc directory.

The skeletal implementation may be challenging depending on how ambitious the goals of the content management system are, but perhaps 10 to 20 programmer-hours of work.

Criticizing other teams' work should take about 15 minutes per project criticized or about two hours total in a class with 8 to 10 projects. This could be done as a group or divided and conquered.

Achieving a lived-in look by pouring in real content shouldn't take more than two hours and ought to be divisible among team members.

Talking to the client will probably take about one hour.

Bad	Date:
Better	Date (MM/DD/YYYY):
Best	Date: