Hashing in Forth
Xan Gregg
Durham, North Carolina
"The time/space tradeoff is what hashing is all about."
Hashing provides a fast way to search a large, unsorted data set at the cost
of extra memory. Robert Sedgewick, in his book Algorithms, concisely describes
hashing as "directly referencing records in a table by doing arithmetic
transformations on keys into table addresses." That should make sense to
you by the end of this article, but first, let's consider a simple example.
Suppose you have to write code to manage a database of about 50,000 records
referenced by 16-bit record numbers. Record insertions and deletions are
common, so they can't be too slow, and record look-ups are frequent and must be
fast. You are given 64K of RAM in addition to the memory and disk space
occupied by the data, and you know that each record is referenced by a unique
three-letter code, like airports are in the U.S.
As a Forth programmer, you realize that a three-letter string is also a
three-digit base-26 number, and you make a table with 26 x 26 x 26=17,576
entries, with each entry containing the record number. Insertion and deletion
are straightforward -- you just have to update the table with each operation.
Finding a record from its key involves only packing three letters into a 15-bit
number and using it as an index into the table. Then you have the record
number.
If you can do that, you already understand the basic concepts of hashing.
Hashing requires a hash function and a hash table. The hash function converts a
key value, such as a text string or a large number, into a hash table address.
Each entry in the hash table points to a record in the data set.
In the example above, the code to pack three letters into a 15-bit number
was the hash function, and the table of record numbers was the hash table. The
hash function might look like Figure One.
Figure One. Example hash function.
: ID>INDEX ( addr -- n ) \ addr points to three uppercase letters
COUNT [CHAR] A - \ addr+1 n1
SWAP COUNT [CHAR] A - \ n1 addr+2 n2
SWAP C@ [CHAR] A - \ n1 n2 n3
26 * + 26 * + + ; \ n
However, since you have 64K of memory and look-up speed is so important, you
could multiply by 32 instead of 26 so that you could use a shift operation by
changing each 26 * into 5 LSHIFT. Having an efficient hash function is very
important, and it is often written in assembler.
Hashing becomes more interesting when the key is too big to be packed into a
number and when the keys are not unique. For example, what if the key was a
person's last name? That's more realistic than the unique, three-letter key
given above. It's not obvious how to use hashing in this situation. We need a
hash function to convert a variable-length string into a table index, and we
need a way of dealing with multiple records that are mapped to the same table
index.
The hash function can be almost anything. Here are some possibilities for
the last-name key mapping into a 15-bit number:
- Use first three letters, five bits per letter.
- Use last three letters, five bits per letter.
- Use first five letters, three bits per letter.
- Multiply all letters, modulo 2^15.
- Use five bits from first letter, two bits from next five letters.
- Use three bits from length byte, three bits from first four letters.
- Sum all letters, modulo 215.
- Sum all letters letter(i)*2^i, modulo 2^15.
I'm sure you can imagine more. The hard part is picking a good one. You want
a function that is quick to execute and provides a fairly random distribution
of keys. #1, above, is not good because many of the names probably start with
the same letters, providing a poor distribution. #4 is bad for two reasons:
multiplying is usually not a good idea for speed's sake, and the distribution
is not good because multiplying several numbers rarely produces odd numbers. #7
is bad because the sum of all of the letters will not approach 215. #8 tries to
overcome this shortcoming by incorporating a shift into the add, and it may do
well with some fine-tuning.
Any of the other hash functions may also do well with tuning. For instance,
when you take three bits from a letter, how you do decide which three bits? The
high three bits is obviously no good, since all upper-case letters start with
the same three bits. The low three bits seems okay, but it's not perfect. H, P,
and X map to 0; B, J, R, and Z map to 1; and G, O, and W map to 7; and those
hardly seem like equal groupings. (But who knows? Maybe there are as many
Greggs, Ouversons, and Wests as there are Browns, Joneses, Rathers, and
Zwickkers.) The only way to know for sure is to try each possibility on
the entire data set and see which one produces the best distribution.
Unfortunately, you often have to develop your code with only a subset of the
data, which requires you to rely more on your intuition.
Before getting too deep into selection of hashing functions, we need to
address the second problem: what to do when two or more keys share the same
index into the hash table, which is called a collision. There are several
solutions, but the one I prefer is to add a field to each record that can point
to another record. Then each table entry becomes the head of a linked list,
with the other links being in the new field we added to each record.
Another solution is to use a secondary hash function to produce another
index or just add a number (relatively prime to the table size) to the index to
get a new index. In either case, you have the possibility of another collision,
which requires finding another hash index, and so on. This solution has the
requirement that the hash table have at least one entry for each record.
Either case requires more work during the search. You must compare the key
against the record found, and, if it is not equal, you must check the next
record, whether from the linked list or from another hash table entry. And
similarly, insertion and deletion become more complicated as well.
So hashing isn't quite as simple as it appeared in our first example, but it
is still very useful when search time is critical. A classic usage for hashing
is in compiler symbol tables. A program may have thousands of symbols which a
compiler must look up very frequently by name as it scans the source code. I
implemented hashing in the MacForth and PowerMacForth vocabularies, and I know
several other Forths also use hashing.
Listing One contains some generic hashing code. FILL-TABLE inefficiently
allocates records on the fly in the data space, but the basic words
(INSERT-RECORD, DELETE-RECORD, and FIND-RECORD) provide a good initial base for
anyone trying to get started with hashing. The code uses the linked-list
approach for collision handling.
ANALYZE-HASH is a useful word for getting some information about how well
the hash function maps the keys. It produces lines numbered zero through nine
which show how many linked lists there are of that size (or more, in the case
of nine), and another line that shows the average list length (not counting the
empty lists). You want to minimize that value. Another useful analysis tool I
have used in the past is to plot the hash table showing the list length at
every entry. Sometimes that reveals patterns of indices that are never hit or
that are hit too often.
Table One gives the average list lengths for various hash functions given a
data set of 3448 Forth words and hash table sizes of 1024 and 4096 entries.
Table One. Average list lengths of various hash functions.
Hash Function W/1024 W/4096
Low bits of first three chars 5.23 3.44
Low bits of length and first two chars 4.36 2.30
Low bits of last three chars 6.31 4.41
Product of first three chars 6.64 3.91
Sum of char(i) * 2^i 3.56 1.57
All of these hash functions provide decent results. A binary search of the
same data would require us to look at log2 3448=11.8 records. The last one is
very close to optimal (3448/1024=3.37), and is a favorite of mine. It is
similar to the function used in PowerMacForth, and it has the advantage that
many characters factor into the result.
This table is the first mention of hash table size since our original, 64K
hash table. The more entries you have in the table, the better your hashing
will work (because there will be fewer collisions), but, of course, it takes
more memory. The time/space tradeoff is what hashing is all about.
Happy hashing!
View Xan Gregg's Forth
hashing code on the Web.
Download
the hashing code via ftp.
This article first appeared in Forth
Dimensions XVII/4.
Readers can contact Xan Gregg via e-mail. He is a Macintosh
programmer at Scholastic, Inc. ,working on educational software, and lives in
Durham, North Carolina. Xan has been programming with MacForth since 1984 and,
in his free time, plays Ultimate Frisbee.
|