The underlying idea for hash tables is simple, and quite appealing: Assume that given a key, there was a way of jumping straight to the entry for that key. Then we would never have to search at all, we could just go there!
Of course, we have not said yet how that could be achieved. Assume
that we have an array data to hold our entries. Now if we had a
function
that assigns a key to the index (an integer) where it
will be stored, then we could just look up data[h(k)] to find
the entry with the key k.
It would be easier if we could just make our array big enough to hold
all the keys that might appear. For example, if we knew that
our keys were the numbers from 0 to 99 we could just create an array
of size 100 and store the entry with key 67 in data[67]. In this
case the function
would be the identity function; that is, the
function defined by
.
However, this idea is not very practical if we are dealing with a relatively small number of keys out of a huge collection of possible keys. For example, many American companies use their employees' 9-digit social security number as a key (British ones don't work quite as well because they are usually a mixture of characters and numbers). Obviously there are hugely more such numbers than any individual company will have employees, and it would probably be impossible (and not very clever) to reserve space for all the 100,000,000 social security number which might occur.
Instead, we do use a non-trivial function
, the so-called
hash function, to map the space of possible keys to the set of
indices of our array. For example, if we had 500 employees we might
create an array with 1000 entries and use three digits from their
social security number to determine the place in the array where the
records for a particular employee should be stored.
There is an obvious problem with this technique which becomes apparent at once: What if two employees have the same such digits? This is called a collision between the two keys. Much of the remainder of this chapter will be spent on the various strategies for dealing with collisions.
First of all, of course, one would try to avoid collisions. If the
keys that are likely to actually occur are not evenly spread in the
space of all possible keys, particular attention should be spent on
choosing the function
in such a way that collisions among those are
unlikely to occur. If, for example, the first three digits of a social
security number had geographical meaning then employees are
particularly likely to have the three digits signifying the region
where the company resides, and so choosing the first three digits as
a hash function might result in many collisions which could have been
avoided by a more prudent choice.