Logo Data Structures and Algorithms with Object-Oriented Design Patterns in C++
next up previous contents index

Character String Keys

Strings of characters are not a built-in C++ type. However, the standard C++ library provides the class called string  which provides support for character string variables. A character string is simply a null-terminated sequence  of characters. A null-terminated sequence of characters is comprised of zero or more non-null characters followed by the null character. Since such a sequence may be arbitrarily long, we have the problem again of mapping an unbounded domain into the finite range of unsigned int.

We can view a character string, s, as a sequence of n characters,

displaymath62650

where n is the length of the string not counting the null terminator. One very simple way to hash such a string would be to simply sum the numeric values associated with each character:

  equation11427

As it turns out, this is not a particularly good way to hash character strings. Given that a char is universally one byte, tex2html_wrap_inline62658 where tex2html_wrap_inline62660, for all tex2html_wrap_inline62662. As a result, tex2html_wrap_inline62664. For example, given a string of length n=5, the value of f(s) falls between zero and 1275. In fact, the situation is even worse, the ASCII  character set only uses seven bits. If the string is comprised of only ASCII characters, the result falls in the range between zero and 640.

Essentially the problem with a function f which produces a result in a relatively small interval is the situation which arises when that function is composed with the function tex2html_wrap_inline62672. If the size of the range of the function f is less than M, then tex2html_wrap_inline62574 does not spread its values uniformly on the interval [0,M-1]. For example, if M=1031 only the first 640 values (62% of the range) are used!

Alternatively, suppose we have a priori knowledge that character strings are limited to length n=4. Then, we can construct an integer by concatenating the binary representations of each of the characters. I.e., given tex2html_wrap_inline62686, we can construct an integer with the function

  equation11436

Since tex2html_wrap_inline62660 is a power of two, this function is easy to write in C++:

HashValue Hash (string const& s)
    { return s[0] << 24 | s[1] << 16 | s[2] << 8 | s [3]; }
While this function certainly has a larger range, it still has two problems. First, it does not take into account the fact that the characters are usually ASCII values, and therefore do not require all eight bits. However, the more significant deficiency of this function is that it cannot deal strings of arbitrary length.

Equation gif can be generalized to deal with strings of arbitrary length as follows:

equation11440

This function produces a unique integer for every possible string. Unfortunately, the range of f(s) is unbounded. A simple modification of this algorithm suffices to bound the range:

  equation11445

where tex2html_wrap_inline62294 such that w is word size of the machine. Unfortunately, since W and B are both powers of two, the value computed by this hash function depends only on the last W/B characters in the character string. E.g., for tex2html_wrap_inline62396 and tex2html_wrap_inline62660, this result depends only on the last four characters in the string--all character strings having exactly the same last four characters collide!

We can improve matters somewhat by exploiting some of the characteristics of the ASCII code. For example, if we assume that the character string only contains 7-bit ASCII characters, then we may use tex2html_wrap_inline62706 in Equation gif. However, this results in only a very slight improvement--for tex2html_wrap_inline62396, the result now depends on the last four characters plus four bits of the fifth-last character.

Writing the code to compute Equation gif is actually quite straightforward if we realize that f(s) can be viewed as a polynomial in B, the coefficients of which are tex2html_wrap_inline62714, tex2html_wrap_inline61127, ..., tex2html_wrap_inline61133. Therefore, we can use Horner's rule  (see Section gif) to compute f(s) as follows:

HashValue result = 0;
for (unsigned int i = 0; s [i] != 0; ++i)
    result = result * B + s [i];
This implementation can be simplified even further if we make use of the fact that tex2html_wrap_inline62722, where b=7. Since B is a power of two, in order to multiply the variable result by B all we need to do is to shift it left by b bits. Furthermore, having just shifted result left by b bits, we know that the least significant b bits of the result are zero. And since s[i] is a character which has no more than b=7 bits, we can replace the addition operation with an exclusive or  operation.
HashValue result = 0;
for (unsigned int i = 0; s [i] != 0; ++i)
    result = result << b ^ s [i];

Of the 128 characters in the 7-bit ASCII character set, only 97 characters are printing characters including the space, tab, and newline characters (see Appendix gif). The remaining characters are control characters which, depending on the application, rarely occur in strings. Furthermore, if we assume that letters and digits are the most common characters in strings, then only 62 of the 128 ASCII codes are used frequently. Notice, the letters (both upper and lower case) all fall between tex2html_wrap_inline62738 and tex2html_wrap_inline62740. All the information is in the least significant six bits. Similarly, the digits fall between tex2html_wrap_inline62742 and tex2html_wrap_inline62744--these differ in the least significant four bits. These observations suggest that using tex2html_wrap_inline62746 should work well. I.e., for tex2html_wrap_inline62396, the hash value depends on the last five characters plus two bits of the sixth-last character.

We have developed a hashing scheme which works quite well given strings which differ in the trailing letters. E.g., the strings "temp1", "temp2", and "temp3", all produce different hash values. However, in certain applications the strings differ in the leading letters. E.g., the two Internet domain names  "ece.uwaterloo.ca" and "cs.uwaterloo.ca" collide when using Equation gif. Essentially, the effect of the characters that differ is lost because the corresponding bits have been shifted out of the hash value.

   program11477
Program: Character String Hash Function Definition

This suggests a final modification which shown in Program gif. Instead of losing the b=6 most significant bits when the variable result is shifted left, we retain those bits and exclusive or  them back into the shifted result variable. Using this approach, the two strings "ece.uwaterloo.ca" and "cs.uwaterloo.ca" produce different hash values.

Table gif lists a number of different character strings together with the hash values obtained using Program gif. For example, to hash the string "fyra", the following computation is performed (all numbers in octal):

146 f
tex2html_wrap_inline62752 171 y
tex2html_wrap_inline62752 162 r
tex2html_wrap_inline62752 141a

147706341

 

 

x Hash(x) (octal)
"ett" 01446564
"två" 01656741
"tre" 01656345
"fyra" 0147706341
"fem" 01474455
"sex" 01624470
"sju" 01625365
"åtta" 0140656541
"nio" 01575057
"tio" 01655057
"elva" 044556741
"tolv" 065565566
Table: Sample Character String Keys and the Hash Values Obtained Using Program gif


next up previous contents index

Bruno Copyright © 1997 by Bruno R. Preiss, P.Eng. All rights reserved.