Keys and Hash Functions

Data Structures and Algorithms with Object-Oriented Design Patterns in C++

Keys and Hash Functions

We are designing a container which will be used to hold some number of items of a given set K. In this context, we call the elements of the set K keys . The general approach is to store the keys in an array. The position of a key in the array is given by a function , called a hash function , which determines the position of a given key directly from that key.

In the general case, we expect the size of the set of keys, |K|, to be relatively large or even unbounded. E.g., if the keys are 32-bit integers, then . Similarly, if the keys are arbitrary character strings of arbitrary length, then |K| is unbounded.

On the other hand, we also expect that the actual number of items stored in the container to be significantly less than |K|. I.e., if n is the number of items actually stored in the container, then . Therefore, it seems prudent to use an array of size M, where M is as least as great as the maximum number of items to be stored in the container.

Consequently, what we need is a function . This function maps the set of values to be stored in the container to subscripts in an array of length M. This function is called a hash function .

In general, since , the mapping defined by hash function will be a many-to-one mapping . I.e., there will exist many pairs of distinct keys x and y, such that , for which h(x)=h(y). This situation is called a collision. Several approaches for dealing with collisions are explored in the following sections.

What are the characteristics of a good hash function?

A good hash function avoids collisions.
A good hash function tends to spread keys evenly in the array.
A good hash function is easy to compute.