Software Structure

part of Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005

Before embarking on a development project it is a good idea to sketch the overall structure of the system to be built.

Gross Anatomy

Any good online learning community will have roughly the same core structure:

user database
content database
user/content map
user/user map

As used above, "database" is an abstract term. The user database, for example, could be implemented as a set of SQL tables within a relational database management system. The tables for the user database need not be separated in any way from tables used to implement other modules, i.e., they would all be owned by the same user and reside within the same tablespace. On the other hand, the user database might be external to the online learning community's core source of persistence. A common case in which the user database can become external is that of a corporation's knowledge-management system, where employees are authenticated by checking a central LDAP server.

A more modern example of how these core databases might become split up would be in the world of Web services. Microsoft Hailstorm, for example, offers to provide user database services to the rest of the Internet. A university might set up complementary communities, one for high school students and one for colleagues at other schools, both anchored by the same database of genomics content. The genomics content database might be running on a physically separate computer from the online communities and advertise its services via WSDL and provide those services via SOAP.

User Database

At a bare minimum the user database has to record the real name and email address of the user. Remember that the more identified, authenticated, and accountable people are, the better the opportunity for building a community out of an aggregate. An environment where anonymous users shout at each other from behind screen names isn't worth the programming and system administration effort. The user database should have a facility for recording the reliability of a user's name and email address since the name is likely to become more reliably known over time and the email address less likely.

To contribute to an accountable and identified environment the user database should be able to store a personal URL for each user. If this is a Yahoo! Geocities page it won't contribute too much to accountability and identification. On the other hand, if the URL starts with "http://research.hp.com/personal/" it will give other users some confidence. Since one of the sad features of the Web as architected in 1990 is that URLs rot, a user database needs an extra field to keep track of what has happened when a robot tries to visit the recorded URL. If a URL has not been reachable on several separate occasions over a one-week period, it is probably safe for a computer program to assume that the URL is out of date and stop displaying it publicly.

The user database should record privacy and contact preferences. Is Jane User willing to let you show her email address to the public? To other registered users? Is Joe User willing to let you spam him with news of the site?

Content Database

The content of an online learning community always includes questions and answers in a discussion forum. A programmer might start by building a table for discussion forum postings. Of the six required elements of online community, magnet content is listed first. Most online learning communities offer published articles that are distinguished from user-contributed questions. A programmer would therefore create a separate table to hold articles. Any well-crafted site that publishes articles provides a facility for users to contribute comments on those articles. This will be another separate table.

Is a pattern emerging here? We distinguish a question in the discussion forum table because it is an item of content that is not a response to any other discussion forum posting. We distinguish articles from comments because an article is an item of content that is not a response to any other content item. Perhaps the representation of articles, comments on articles, questions, answers, etc. should be unified to the maximum extent possible. Each is a content item. Each has one or more authors. Each may optionally be a response to another content item.

Here are some services that would be nice to centralize in a single content repository within the content database:

versioning of content
whether an item of content is a reply to, a comment on, or an attachment to some other item
whether an item of content has been approved or disapproved by the site moderators
to whom may the content be shown? Is it only for members of a group, a particular user, or le grand public (as they say in France)
who has the right to edit the content?
who has the right to change who has the right to view or edit?
who has the right to comment on an item? Who must review comments that have been posted before they go live?
timing the content: When does it go live? When does it expire?
quality or importance of the content: Should this be highlighted to users? Should it be withheld until it graduates from draft status?
full-text indexing of the content
summaries, descriptions, and keywords for the content
a consistent site-wide taxonomy
plus some things that really belong in the user/content map below
who authored or contributed an item, with a distinction among publisher-authored, group-authored, and user-authored stuff
who should be notified via email when a comment on or response to an item is posted
whether a content item is rated of high interest by a user or low/no interest; given these stats, this implies the ability to pick out "new content that is likely to be of interest to User #17" (depends on text processing software that can compute document similarity)

User/Content Map

An online learning community generally needs to be able to record the following statements:

User #21 contributed Comment #37 on Article #529
User #192 asked Question #512
User #451 posted Answer #3 to Question #924
User #1392 has read Article #456
User #8923 is interested in being alerted when a change is made to Article #223
User #8923 is interested in being alerted when an answer to Question #9213 is posted

We are careful to record authorship because attributed content contributes to our chances of building a real community. To offer users the service of email notifications when someone responds to a question or comments on an article, it is necessary to record authorship.

Why record the fact that a particular user has read, or at least downloaded, a particular document? Consider an online learning community of professors and students at a university. It is necessary to record readership if one wishes to write a robot that sends out messages like the following:

To: Sam Student
From:  Community Nag Robot
Date:  Friday, 4:30 pm
Subject:  Your Lazy Bones

Sam,

I notice that you have four assignments due on Monday and that you
have not even looked at two of them.  I hope that you aren't planning
to go to a fraternity party tonight instead of studying.

Very truly yours,

Some SQL Code

Once an online learning community is recording the act of readership, it is natural to consider recording whether or not the act of reading proved worthwhile. In general collaborative filtering is the last refuge of those too cowardly to edit. However, recording "User #7241 really liked Article #2451" opens up interesting possibilities for personalization.

Consider a corporate knowledge-management system. At the beginning the database is empty and there are only a few users. Scanning the titles of all contributed content would take only a few minutes. After five years, however, the database contains 100,000 documents and the 10,000 active users are contributing several hundred new documents every day (keep in mind that a question or answer in a discussion forum is a "document" for the purpose of this discussion). If Jane User wants to see what her coworkers have been up to in the last 24 hours, it might take her 30 minutes to scan the titles of the new content. Jane User may well abandon an online learning community that, when smaller, was very useful to her.

Suppose now that the database contains 100 entries of the form "Jane liked this article" and 100 entries of the form "Jane did not like this article". Before Jane has arrived at work, a batch job can compare every new article in the system to the 100 articles that Jane liked and the 100 articles that Jane said she did not like. This comparison can be done using most standard full-text search software, which will take two documents and score them for similarity based on words used. Each new document is given a score of the form

avg(similarity(:new_doc, all_docs_marked_as_liked_by_user(:user_id)))
-
avg(similarity(:new_doc, all_docs_marked_as_disliked_by_user(:user_id)))

The new documents are then presented to Jane ranked by descending score. If you're an Intel stockholder you'll be pleased to consider the computational implications of this personalization scheme. Every new document must be compared to every document previously marked by a user. Perhaps that is 200 comparisons. If there are 10,000 users, this scoring operation must be repeated 10,000 times. So that is 2,000,000 comparisons per day per new document in the system. Full-text comparisons generally are quite slow as they rely on looking up up each word in a document to find its occurrence frequency in standard written English. A comparison of two documents can take 1/10th of a second of CPU time. We're thus looking at about 200,000 seconds of CPU time per new document added to the system, plus the insertion of 10,000 rows in the database, each row containing the personalization score of that document for a particular user. There are 86,400 seconds in a day. Thus we're talking roughly about enough work to keep a 3-CPU multiprocessor machine busy for an entire day. What if 500 documents are uploaded every day? We'll need 1500 CPUs to compute personalization scores.

User/User Map

Relationships among users become increasingly important as communities grow. Someone who is in a discussion forum with 100 others may wish to say "I am offended by User #45's perspective; I want the system to suppress his contributions in pages served to me or email alerts sent to me". The technical term for this is bozo filtration and it dates back at least to the early 1980s and the USENET (Netnews) distributed discussion forum system. Someone who is in a discussion forum with 100,000 others may wish to say "I am overwhelmed; I never want to see anything more from this forum unless User #67329 has contributed to a thread."

Grouping of users is the most fundamental operation within the User/User database. In a collaborative medical records system, you need to be able say "All of these users work at the same hospital and can have access to records for patients at that hospital." In a corporate knowledge-sharing system, you need to be able to say "All of these users work in the same department and therefore should have access to private departmental documents, a private discussion forum devoted to departmental issues, and should receive email notifications of departmental events."

Let's move on from the core data model to some tips for the software that you're soon to be building on top of the database...

Send SQL, Not Data, to the Database's SQL Parser

In the Basics chapter you might have written scripts that took user input combined them with bits of SQL and sent a final single string command to the relational database management system (RDBMS).

Here's a C# example stolen from one of our students:

string cmd = "Insert into quotations(author_name, category, quote) 
              values ('" + txtAuthor.Text.Replace("'", "''") + "', 
                      '" + ctg.Replace("'", "''") + "', 
                      '" + txtQuotation.Text.Replace("'", "''") + "')";

UpdateDB(cmd);  // ship that big string to SQL Server

There are several minor things wrong with this approach, which mixes SQL and string literals obtained from the user:

the programmer must remember to escape any single quote characters in the uploaded string, replacing ' with '' [these are two single quotes, not one double quote]
the statement might become too long for some RDBMS SQL parsers to handle and/or the string literals might exceed limits (Oracle 9.x imposes a 4000-character limit on string literals) if the user is waxing expansive at the browser
repeated invocations of this script will result in the RDBMS being fed versions of this SQL statement that are morphologically the same but differ in actual text; depending on how the RDBMS is implemented this might prevent the query plan from being reused

Much more serious, however, is the possibility that a malicious user could craft a form submission that would result in destruction of data or violation of privacy. For example, consider the following code:

string EventQuery = "select * 
                     from events 
                     where event_id = " + EventIDfromBrowser;

Expecting a numeric event ID and knowing that numbers do not need to be wrapped in quotes like a string literal, the programmer does no processing on EventIDfromBrowser, a variable read from the open Internet.

Suppose that an evil-minded person submits a form with EventIDfromBrowser set to "42; select * from user_passwords". The semicolon near the beginning of this string could potentially terminate the first SELECT and the unauthorized "select * from user_passwords" query might then be executed. If the unauthorized query is well-crafted the information resulting from it might be presented in a browser window. Another scary construct would be "42; delete from customers".

You can solve all of these problems by separating SQL code and variable data. Here's a pseudo-code example of how it has been done using standard libraries going back to the late 1970s:

// associate the name "event_query" with a string of SQL
PrepareStatement("event_query","select * from events where event_id = :event_id");

// associate the bind variable :event_id with the particular value for this page
BindVar("event_query",":event_id",3722);

// ask the RDBMS to execute the completed query
ExecuteStatement("event_query");

... fetch results ...

Note that the structure of the SQL seen by the RDBMS is fixed as "select * from events where event_id = :event_id", regardless of what input is received in the form. Only the value of :event_id changes.

This is an example of using bind variables, which is standard practice in most software that talks to an RDBMS.

Bind Variables in C#

using System;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;

namespace ExecuteScalar
{
    /// 
    /// An example of how to use named parameters in ADO.NET.
    /// 
    class Class1
    {
        /// 
        /// The main entry point for the application.
        /// 
        [STAThread]
        static void Main(string[] args)
        {
            object objResult = null;
            string strResult = null;
            string strEmployeeID = "PMA42628M";

            //Initialize the database connection, command and parameter objects.
            SqlConnection conn = new SqlConnection(
                ConfigurationSettings.AppSettings["connStr"]
                );
            SqlCommand cmd = new SqlCommand(
                "select fname from employee where emp_id = @emp_id"
                );
            SqlParameter param = new SqlParameter("@emp_id",strEmployeeID);

            //Associate the connection with the command.
            cmd.Connection = conn;

            //Bind the parameter value to the command.
            cmd.Parameters.Add(param);

            //Connect to the database and run the command.
            try
            {
                conn.Open();
                objResult = cmd.ExecuteScalar();
            } 
            catch (Exception e)
            {
                Console.WriteLine("Database error: {0}", e.ToString());
            }
            finally
            {
                //Clean up.
                if (!conn.State.Equals(ConnectionState.Closed))
                {
                    conn.Close();
                }
            }
                

            //Convert the query result to a string.
            if (objResult == null)
            {
                strResult = "[NULL]";
            } 
            else
            {
                strResult = objResult.ToString();
            }

            Console.WriteLine("Employee #{0}'s first name is: '{1}'", strEmployeeID, strResult);
            Console.ReadLine();
        }
    }
}

Not too much to note here except that Microsoft seems to like @emp_id rather than Oracle's :emp_id, i.e., they use the at-sign rather than the colon to indicate that something is a bind variable.

Bind Variables in Java

Here's a code fragment showing the use of bind variables in Java:

PreparedStatement updateSales = con.prepareStatement(
	"UPDATE COFFEES SET SALES = ? WHERE COF_NAME LIKE ? ");
updateSales.setInt(1, 75); 
updateSales.setString(2, "Colombian"); 
updateSales.executeUpdate():

Source: The Java Tutorial at http://java.sun.com/docs/books/tutorial/jdbc/basics/prepared.html.

Note that JDBC, the Java database connectivity library, uses "?" as a bind variable. It is up to the programmer to count the Nth occurrence of the ? in a SQL string and bind a value to that. As you can imagine this process becomes error-prone if the SQL statement contains fifteen or more variables, a very common situation in real applications. You can also imagine the possibilities for introducing subtle bugs if the SQL query is changed and the bind variable sequence numbers are not properly updated.

Supposedly this situation has been improved in JDBC 3.0, but this example of Java's inferiority to C interfaces from twenty years earlier should remind you to be skeptical of vendor claims for the advantages of new languages and development tools.

Configurable View Source Links

In the "Basics" chapter you added a hard-coded "View Source" link to every page that you built. It is important for the remainder of the course to continue making your source code available to colleagues. On the other hand, when the application launches to the public you wouldn't necessarily want everyone to see your dirty laundry. You probably want a single switch, somewhere on the server, that can turn on or off all the "view source" links on the site. This means that all of your scripts will have to be calling a single procedure to decide whether or not to offer a "View Source" link. In the long run, this procedure might decide based on IP address or logged-in user ID whether to serve a View Source link.

Get the Database Username and Password out of the Page Scripts

Suppose that you have the following code in one of your page scripts:

dbconn = OpenDBConn("sysid=local,username=joestest,password=joerocks");

There are several problems with this approach to connecting to an RDBMS:

An evildoer reading this source code might be able to connect to the RDBMS running on your server and drop all of your tables.
Running this code against a test database, which will necessarily have a different database username, will require editing each and every page script.
Reusing this code on another project will require changing the database username and password.

Every good Web development environment has a means of pooling connections from the page scripts to the database so that the Web server need not re-connect and re-authenticate itself to the database millions of times per day. Generally the idea is that the Web server will open a handful of connections to the RDBMS and keep them open. When a page script needs to execute a query, it grabs one of the connections from the pool and uses it until page service is complete. The connection is then returned to the pool. This scheme is called connection pooling.

Often a good way to get the database username and password out of page scripts is to use the Web server's database connection pooling system.

Time and Motion

Teams should spend at least an hour together drawing a pen and paper sketch that identifies and locates the basic information that the system will process. Detailed data modeling is not necessary, as it will be covered in the next two chapters.

The work of figuring out bind variables, hiding the database password, and generalizing the view source code may be split among the team members. This ought to be doable within six to eight programmer-hours.