/foobar/
,
our newest and most important section? (answer leads to action: maybe
add more pointers to the new section from other areas of the site)
Below is a snippet from the file http://philip.greenspun.com/seia/examples-user-activity-analysis/2003-03-06.log.gz, which records one day of activity on this server (philip.greenspun.com). Notice that the name of the log file, "2003-03-06", is arranged so that chronological success will result in lexicographical sorting succession and therefore, when viewing files in a directory listing, you'll see a continuous progression from oldest to newest. The file itself is in the "Common Logfile Format", a standard developed in 1995.
The first line can be decoded as follows:193.2.79.250 - - [06/Mar/2003:09:11:59 -0500] "GET /dogs/george HTTP/1.1" 200 0 "http://www.photo.net/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:11:59 -0500] "GET /dogs/sky-and-philip.jpg HTTP/1.1" 200 9596 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:11:59 -0500] "GET /dogs/george-28.jpg HTTP/1.1" 200 10154 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:11:59 -0500] "GET /dogs/nika-36.jpg HTTP/1.1" 200 8627 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:11:59 -0500] "GET /dogs/george-nika-provoke.jpg HTTP/1.1" 200 11949 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 152.31.2.221 - - [06/Mar/2003:09:11:59 -0500] "GET /comments/attachment/36106/bmwz81.jpg HTTP/1.1" 200 38751 "http://philip.greenspun.com/materialism/cars/nsx.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 193.2.79.250 - - [06/Mar/2003:09:12:00 -0500] "GET /dogs/george-nika-grapple.jpg HTTP/1.1" 200 7887 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:12:00 -0500] "GET /dogs/george-nika-bite.jpg HTTP/1.1" 200 10977 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:12:00 -0500] "GET /dogs/george-29.jpg HTTP/1.1" 200 10763 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 193.2.79.250 - - [06/Mar/2003:09:12:00 -0500] "GET /dogs/philip-and-george-sm.jpg HTTP/1.1" 200 9574 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 152.31.2.221 - - [06/Mar/2003:09:12:00 -0500] "GET /comments/attachment/44949/FriendsProjectCar.jpg HTTP/1.1" 200 36340 "http://philip.greenspun.com/materialism/cars/nsx.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 193.2.79.250 - - [06/Mar/2003:09:12:00 -0500] "GET /comments/attachment/35069/muffin.jpg HTTP/1.1" 200 15017 "http://philip.greenspun.com/dogs/george" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 152.31.2.221 - - [06/Mar/2003:09:12:01 -0500] "GET /comments/attachment/77819/z06.jpg HTTP/1.1" 200 46996 "http://philip.greenspun.com/materialism/cars/nsx.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 151.199.192.112 - - [06/Mar/2003:09:12:01 -0500] "GET /comments/attachment/137758/GT%20NSX%202.jpg HTTP/1.1" 200 12656 "http://philip.greenspun.com/materialism/cars/nsx" "Mozilla/4.0 (compatible; MSIE 5.0; Mac_PowerPC)" 152.31.2.221 - - [06/Mar/2003:09:12:02 -0500] "GET /comments/attachment/171519/photo_002.jpg HTTP/1.1" 200 45618 "http://philip.greenspun.com/materialism/cars/nsx.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 151.199.192.112 - - [06/Mar/2003:09:12:27 -0500] "GET /comments/attachment/143336/Veil%20Side%20Skyline%20GTR2.jpg HTTP/1.1" 200 40372 "http://philip.greenspun.com/materialism/cars/nsx" "Mozilla/4.0 (compatible; MSIE 5.0; Mac_PowerPC)" 147.102.16.28 - - [06/Mar/2003:09:12:29 -0500] "GET /photo/pcd1253/canal-street-43.1.jpg HTTP/1.1" 302 336 "http://philip.greenspun.com/wtr/application-servers.html" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)" 147.102.16.28 - - [06/Mar/2003:09:12:29 -0500] "GET /photo/pcd2388/john-harvard-statue-7.1.jpg HTTP/1.1" 302 342 "http://philip.greenspun.com/wtr/application-servers.html" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)" 147.102.16.28 - - [06/Mar/2003:09:12:31 -0500] "GET /wtr/application-servers.html HTTP/1.1" 200 0 "http://www.google.com/search?q=application+servers&ie=ISO-8859-7&hl=el&lr=" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
A user on a computer at the IP address 193.2.79.250, who is not telling us his login name on that computer nor supplying an HTTP authentication login name to the Web server (- -), on March 6, 2003 at 9 hours 11 minutes 59 seconds past midnight in a timezone 5 hours behind Greenwich Mean Time (06/Mar/2003:09:11:59 -0500), requested the file /dogs/george using the GET method of the HTTP/1.1 protocol. The file was found by the server and returned normally (status code of 200) but it was returned by an ill-behaved script that did not give the server information about how many bytes were written, hence the 0 after the status code. This user followed a link to this URL from http://www.photo.net/ (the referer header) and is using a browser that first falsely identifies itself as Netscape 4.0 (Mozilla 4.0), but then explains that it is actually merely compatible with Netscape and is really Microsoft Internet Explorer 5.0 on Windows NT (MSIE 5.0; Windows NT). On a lightly used service we might have configured the server to useThat's a lot of information in one line, but consider what is missing. If this user previously logged in and presented anslookup
and log the hostname ofstargate.fs.uni-lj.si
rather than the IP address, in which case we'd have been able to glance at the log and see that it was someone at a university in Slovenia.
user_id
cookie, we can't tell and we don't have that user ID. On an e-commerce
site we might be able to infer that the user purchased something by
the presence of a line showing a successful request for a
"complete-purchase" URL. However we won't see the dollar amount of
that purchase, and surely a $1000 purchase is much more interesting
than a $10 purchase.
When that isn't sufficient, you can start adding columns to database
tables. In a sense you've already started this process. You probably
have a registration_date
column in your
users
table, for example. This information could be
derived from the access logs, but if you need it to show a "member
since 2001" annotation as part of their user profile, it makes more
sense to keep it in the RDBMS. If you want to offer members a page of
"new items since your last visit" you'll probably add
last_login
and second_to_last_login
columns
to the users table. Note that you need
second_to_last_login
because as soon as User #345 returns
to the site your software will update last_login
. When
he or she clicks the "new since last visit" page, it might be only thirty
seconds since the timestamp in the last_login
column.
What User #345 will more likely expect is new content since the
preceding Monday, his or her previous session with the service.
Suppose the marketing department starts running ad campaigns on ten
different sites with the goal of attracting new members. They'll want
a report of how many people registered who came from each of those ten
foreign sites. Each ad would be a hyperlink to an encoded URL on your
server. This would set a session cookie saying "source=nytimes" ("I
came from an ad on the New York Times Web site"). If that person
eventually registered as a member, the token "nytimes" would be written
into a source
column in the users
table.
After a month you'll be asked to write an admin page querying the
database and displaying a histogram of registration by day, by month,
by source, etc.
The road of adding columns to transaction-processing tables and building ad hoc SQL queries to answer questions is a long and tortuous one. The traditional way back to a manageable information system with users getting the answers they need is the dimensional data warehouse, discussed at some length in the data warehousing chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/data-warehousing. A data warehouse is a heavily denormalized copy of the information in the transaction-processing tables, arranged so as to facilitate queries rather than updates.
The exercises in this chapter will walk you through these three alternatives, each of which has its place.
An acceptable solution to this exercise will involve linking the most recent report from the site administration pages so that the publisher can view it. A better solution will involve placing a "prepare current report" link in the admin pages that will invoke the log analyzer on demand and display the report. An exhaustive (exhausting?) solution will consist of a scheduled process ("cron job" in Unix parlance, "at command" or "scheduled task" on Windows) that runs the log analyzer every day, updating cumulative reports and preparing a new daily report, all of which are accessible from the site admin pages.
Make sure that your report clearly shows "404 Not Found" requests (any standard log analyzer can be configured to display these) and that the referer header is displayed so that you can figure out where the bad link is likely to be.
In the Unix world the most effective solution to this challenge is
chroot
, short for change root. This command
changes the file system root of the Web server, and any program
started by the Web server, to some other place in the file system,
e.g., /web/main-server/
. A program in the directory
/usr/local/bin/
can't be executed by the chrooted Web
server because the Web server can't even describe a file unless its
path begins with /web/main-server/
. The root directory,
/
, is now /web/main-server/
. One downside
of this approach is that if the Web server needs to run a program in
the directory /usr/local/bin/
it can't. The solution is
to take all of the utilities, server log analyzers, and other required
programs and move them underneath /web/main-server/
, e.g., to
/web/main-server/bin/
.
Sadly, there does not seem to be a Windows equivalent to
chroot
, though there are other ways to lock down a Web server
in Windows so that its process can't execute programs.
First make sure that your server is configured to log as much information as possible about each error. At the very least you need the server to log the URL where the error occurred and the error message from the procedure that raised the error. Better Web development environments will also log a stack backtrace.
Second, provide a hyperlink from the site-wide administration pages to a page that shows the most recent 500 lines of the error log, with an option to go back a further 500 lines, etc.
Third, write a procedure that runs periodically, either as a separate process or as part of the HTTP server program itself, and scans the error log for new entries since the preceding run of the procedure. If any of those new entries are actual errors, the procedure emails them to the programmers maintaining the site. You might want to start with an interval of one hour.
notify_the_maintainers
procedure that sends email. This might be worth including, for
example, in a centralized facility that allows page scripts to connect
to the relational database management system (RDBMS). If the RDBMS is
unavailable, the sysadmins, dbadmins, and programmers ought to be
notified immediately so that they can figure out what went wrong and
bring the system back up.
Suppose that an RDBMS failure were combined with a naive
implementation of notify_the_maintainers
on a site that
gets 10 requests per second. Suppose further that all of the people
on the email notification list have gone out for lunch together for
one hour. Upon their return, they will find 60x60x10 = 36,000
identical email messages in their inbox.
To avoid this kind of debacle, it is probably best to have
notify_the_maintainers
record a
last_notification_sent
timestamp in the HTTP server's
memory or on disk and use it to ignore or accumulate requests for
notification that come in, say, within 15 minutes of a previous
request. A reasonable assumption is that a programmer, once alerted,
will visit the server and start looking at the full error logs. Thus
notify_the_maintainers
need not actually send out
information about every problem encountered.
Write down the questions that your client says are the most important.
The biggest design decision that you'll face during this exercise is the granularity of the fact table. If you're interested in how users get from page to page within a site, the granularity of the fact table must be "one request". On a site such as the national "don't call me" registry, www.donotcall.gov, launched in 2003, one would expect a person to visit only once. Therefore the user activity data warehouse might store just one row per registered user, summarizing their appearance at the site and completion of registration, a fact table granularity of "one user". For many services, an intermediate granularity of "one session" will be appropriate.
With a "one session" granularity and appropriate dimensions it is possible to ask questions such as "What percentage of the sessions were initiated in response to an ad at Google.com?" (source field added to the fact table) "Compare the likelihood that a purchase was made by users on their fourth versus fifth sessions with the service?" (nth-session field added to the fact table) "Compare the value of purchases made in sessions by foreign versus domestic customers" (purchase amount field added to the fact table plus a customer dimension).