next up previous contents
Next: Likehood of collisions Up: Hash tables Previous: Implementation via Hash tables   Contents

Hash tables

The underlying idea for hash tables is simple, and quite appealing: Assume that given a key, there was a way of jumping straight to the entry for that key. Then we would never have to search at all, we could just go there!

Of course, we have not said yet how that could be achieved. Assume that we have an array data to hold our entries. Now if we had a function $ h$ that assigns a key to the index (an integer) where it will be stored, then we could just look up data[h(k)] to find the entry with the key k.

It would be easier if we could just make our array big enough to hold all the keys that might appear. For example, if we knew that our keys were the numbers from 0 to 99 we could just create an array of size 100 and store the entry with key 67 in data[67]. In this case the function $ h$ would be the identity function; that is, the function defined by $ h(k)=k$.

However, this idea is not very practical if we are dealing with a relatively small number of keys out of a huge collection of possible keys. For example, many American companies use their employees' 9-digit social security number as a key (British ones don't work quite as well because they are usually a mixture of characters and numbers). Obviously there are hugely more such numbers than any individual company will have employees, and it would probably be impossible (and not very clever) to reserve space for all the 100,000,000 social security number which might occur.

Instead, we do use a non-trivial function $ h$, the so-called hash function, to map the space of possible keys to the set of indices of our array. For example, if we had 500 employees we might create an array with 1000 entries and use three digits from their social security number to determine the place in the array where the records for a particular employee should be stored.

There is an obvious problem with this technique which becomes apparent at once: What if two employees have the same such digits? This is called a collision between the two keys. Much of the remainder of this chapter will be spent on the various strategies for dealing with collisions.

First of all, of course, one would try to avoid collisions. If the keys that are likely to actually occur are not evenly spread in the space of all possible keys, particular attention should be spent on choosing the function $ h$ in such a way that collisions among those are unlikely to occur. If, for example, the first three digits of a social security number had geographical meaning then employees are particularly likely to have the three digits signifying the region where the company resides, and so choosing the first three digits as a hash function might result in many collisions which could have been avoided by a more prudent choice.


next up previous contents
Next: Likehood of collisions Up: Hash tables Previous: Implementation via Hash tables   Contents
Martin Escardo 2005-01-11