A more modern example of how these core databases might become split up would be in the world of Web services. Microsoft Hailstorm, for example, offers to provide user database services to the rest of the Internet. A university might set up complementary communities, one for high school students and one for colleagues at other schools, both anchored by the same database of genomics content. The genomics content database might be running on a physically separate computer from the online communities and advertise its services via WSDL and provide those services via SOAP.
To contribute to an accountable and identified environment the user database should be able to store a personal URL for each user. If this is a Yahoo! Geocities page it won't contribute too much to accountability and identification. On the other hand, if the URL starts with "http://research.hp.com/personal/" it will give other users some confidence. Since one of the sad features of the Web as architected in 1990 is that URLs rot, a user database needs an extra field to keep track of what has happened when a robot tries to visit the recorded URL. If a URL has not been reachable on several separate occasions over a one-week period, it is probably safe for a computer program to assume that the URL is out of date and stop displaying it publicly.
The user database should record privacy and contact preferences. Is Jane User willing to let you show her email address to the public? To other registered users? Is Joe User willing to let you spam him with news of the site?
Is a pattern emerging here? We distinguish a question in the discussion forum table because it is an item of content that is not a response to any other discussion forum posting. We distinguish articles from comments because an article is an item of content that is not a response to any other content item. Perhaps the representation of articles, comments on articles, questions, answers, etc. should be unified to the maximum extent possible. Each is a content item. Each has one or more authors. Each may optionally be a response to another content item.
Here are some services that would be nice to centralize in a single content repository within the content database:
Why record the fact that a particular user has read, or at least downloaded, a particular document? Consider an online learning community of professors and students at a university. It is necessary to record readership if one wishes to write a robot that sends out messages like the following:
To: Sam Student
From: Community Nag Robot
Date: Friday, 4:30 pm
Subject: Your Lazy Bones
Sam,
I notice that you have four assignments due on Monday and that you
have not even looked at two of them. I hope that you aren't planning
to go to a fraternity party tonight instead of studying.
Very truly yours,
Some SQL Code
Once an online learning community is recording the act of readership, it is natural to consider recording whether or not the act of reading proved worthwhile. In general collaborative filtering is the last refuge of those too cowardly to edit. However, recording "User #7241 really liked Article #2451" opens up interesting possibilities for personalization.
Consider a corporate knowledge-management system. At the beginning the database is empty and there are only a few users. Scanning the titles of all contributed content would take only a few minutes. After five years, however, the database contains 100,000 documents and the 10,000 active users are contributing several hundred new documents every day (keep in mind that a question or answer in a discussion forum is a "document" for the purpose of this discussion). If Jane User wants to see what her coworkers have been up to in the last 24 hours, it might take her 30 minutes to scan the titles of the new content. Jane User may well abandon an online learning community that, when smaller, was very useful to her.
Suppose now that the database contains 100 entries of the form "Jane liked this article" and 100 entries of the form "Jane did not like this article". Before Jane has arrived at work, a batch job can compare every new article in the system to the 100 articles that Jane liked and the 100 articles that Jane said she did not like. This comparison can be done using most standard full-text search software, which will take two documents and score them for similarity based on words used. Each new document is given a score of the form
avg(similarity(:new_doc, all_docs_marked_as_liked_by_user(:user_id)))
-
avg(similarity(:new_doc, all_docs_marked_as_disliked_by_user(:user_id)))
The new documents are then presented to Jane ranked by descending
score. If you're an Intel stockholder you'll be pleased to consider
the computational implications of this personalization scheme. Every
new document must be compared to every document previously marked by a
user. Perhaps that is 200 comparisons. If there are 10,000 users,
this scoring operation must be repeated 10,000 times. So that is
2,000,000 comparisons per day per new document in the system.
Full-text comparisons generally are quite slow as they rely on looking up
up each word in a document to find its occurrence frequency in
standard written English. A comparison of two documents can take
1/10th of a second of CPU time. We're thus looking at about 200,000 seconds of
CPU time per new document added to the system, plus the insertion of
10,000 rows in the database, each row containing the personalization
score of that document for a particular user. There are 86,400
seconds in a day. Thus we're talking roughly about enough work to
keep a 3-CPU multiprocessor machine busy for an entire day. What if
500 documents are uploaded every day? We'll need 1500 CPUs to compute
personalization scores.
Grouping of users is the most fundamental operation within the User/User database. In a collaborative medical records system, you need to be able say "All of these users work at the same hospital and can have access to records for patients at that hospital." In a corporate knowledge-sharing system, you need to be able to say "All of these users work in the same department and therefore should have access to private departmental documents, a private discussion forum devoted to departmental issues, and should receive email notifications of departmental events."
Let's move on from the core data model to some tips for the software that you're soon to be building on top of the database...
Here's a C# example stolen from one of our students:
string cmd = "Insert into quotations(author_name, category, quote)
values ('" + txtAuthor.Text.Replace("'", "''") + "',
'" + ctg.Replace("'", "''") + "',
'" + txtQuotation.Text.Replace("'", "''") + "')";
UpdateDB(cmd); // ship that big string to SQL Server
There are several minor things wrong with this approach, which mixes
SQL and string literals obtained from the user:
string EventQuery = "select *
from events
where event_id = " + EventIDfromBrowser;
Expecting a numeric event ID and knowing that numbers do not need to
be wrapped in quotes like a string literal, the programmer does no
processing on EventIDfromBrowser
, a variable read from
the open Internet.
Suppose that an evil-minded person submits a form with
EventIDfromBrowser
set to "42; select * from
user_passwords"
. The semicolon near the beginning of this string
could potentially terminate the first SELECT and the unauthorized
"select * from user_passwords" query might then be executed. If the
unauthorized query is well-crafted the information resulting from it
might be presented in a browser window. Another scary construct would
be "42; delete from customers"
.
You can solve all of these problems by separating SQL code and variable data. Here's a pseudo-code example of how it has been done using standard libraries going back to the late 1970s:
// associate the name "event_query" with a string of SQL
PrepareStatement("event_query","select * from events where event_id = :event_id");
// associate the bind variable :event_id with the particular value for this page
BindVar("event_query",":event_id",3722);
// ask the RDBMS to execute the completed query
ExecuteStatement("event_query");
... fetch results ...
Note that the structure of the SQL seen by the RDBMS is fixed as
"select * from events where event_id = :event_id"
,
regardless of what input is received in the form. Only the value of
:event_id
changes.
This is an example of using bind variables, which is standard practice in most software that talks to an RDBMS.
using System;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
namespace ExecuteScalar
{
///
/// An example of how to use named parameters in ADO.NET.
///
class Class1
{
///
/// The main entry point for the application.
///
[STAThread]
static void Main(string[] args)
{
object objResult = null;
string strResult = null;
string strEmployeeID = "PMA42628M";
//Initialize the database connection, command and parameter objects.
SqlConnection conn = new SqlConnection(
ConfigurationSettings.AppSettings["connStr"]
);
SqlCommand cmd = new SqlCommand(
"select fname from employee where emp_id = @emp_id"
);
SqlParameter param = new SqlParameter("@emp_id",strEmployeeID);
//Associate the connection with the command.
cmd.Connection = conn;
//Bind the parameter value to the command.
cmd.Parameters.Add(param);
//Connect to the database and run the command.
try
{
conn.Open();
objResult = cmd.ExecuteScalar();
}
catch (Exception e)
{
Console.WriteLine("Database error: {0}", e.ToString());
}
finally
{
//Clean up.
if (!conn.State.Equals(ConnectionState.Closed))
{
conn.Close();
}
}
//Convert the query result to a string.
if (objResult == null)
{
strResult = "[NULL]";
}
else
{
strResult = objResult.ToString();
}
Console.WriteLine("Employee #{0}'s first name is: '{1}'", strEmployeeID, strResult);
Console.ReadLine();
}
}
}
Not too much to note here except that Microsoft seems to like
@emp_id
rather than Oracle's :emp_id
, i.e.,
they use the at-sign rather than the colon to indicate that something
is a bind variable.
PreparedStatement updateSales = con.prepareStatement(
"UPDATE COFFEES SET SALES = ? WHERE COF_NAME LIKE ? ");
updateSales.setInt(1, 75);
updateSales.setString(2, "Colombian");
updateSales.executeUpdate():
Source: The Java Tutorial at http://java.sun.com/docs/books/tutorial/jdbc/basics/prepared.html.
Note that JDBC, the Java database connectivity library, uses "?" as a bind variable. It is up to the programmer to count the Nth occurrence of the ? in a SQL string and bind a value to that. As you can imagine this process becomes error-prone if the SQL statement contains fifteen or more variables, a very common situation in real applications. You can also imagine the possibilities for introducing subtle bugs if the SQL query is changed and the bind variable sequence numbers are not properly updated.
Supposedly this situation has been improved in JDBC 3.0, but this example of Java's inferiority to C interfaces from twenty years earlier should remind you to be skeptical of vendor claims for the advantages of new languages and development tools.
dbconn = OpenDBConn("sysid=local,username=joestest,password=joerocks");
There are several problems with this approach to connecting to an RDBMS:
Often a good way to get the database username and password out of page scripts is to use the Web server's database connection pooling system.
The work of figuring out bind variables, hiding the database password, and generalizing the view source code may be split among the team members. This ought to be doable within six to eight programmer-hours.