Hashing in Forth

Xan Gregg
Durham, North Carolina

"The time/space tradeoff is what hashing is all about."

Hashing provides a fast way to search a large, unsorted data set at the cost of extra memory. Robert Sedgewick, in his book Algorithms, concisely describes hashing as "directly referencing records in a table by doing arithmetic transformations on keys into table addresses." That should make sense to you by the end of this article, but first, let's consider a simple example.

Suppose you have to write code to manage a database of about 50,000 records referenced by 16-bit record numbers. Record insertions and deletions are common, so they can't be too slow, and record look-ups are frequent and must be fast. You are given 64K of RAM in addition to the memory and disk space occupied by the data, and you know that each record is referenced by a unique three-letter code, like airports are in the U.S.

As a Forth programmer, you realize that a three-letter string is also a three-digit base-26 number, and you make a table with 26 x 26 x 26=17,576 entries, with each entry containing the record number. Insertion and deletion are straightforward -- you just have to update the table with each operation. Finding a record from its key involves only packing three letters into a 15-bit number and using it as an index into the table. Then you have the record number.

If you can do that, you already understand the basic concepts of hashing. Hashing requires a hash function and a hash table. The hash function converts a key value, such as a text string or a large number, into a hash table address. Each entry in the hash table points to a record in the data set.

In the example above, the code to pack three letters into a 15-bit number was the hash function, and the table of record numbers was the hash table. The hash function might look like Figure One.

Figure One. Example hash function.

: ID>INDEX  ( addr -- n )     \ addr points to three uppercase letters
   COUNT [CHAR] A -           \ addr+1 n1
   SWAP COUNT [CHAR] A -      \ n1 addr+2 n2
   SWAP C@ [CHAR] A -         \ n1 n2 n3
   26 * + 26 * + + ;          \ n

However, since you have 64K of memory and look-up speed is so important, you could multiply by 32 instead of 26 so that you could use a shift operation by changing each 26 * into 5 LSHIFT. Having an efficient hash function is very important, and it is often written in assembler.

Hashing becomes more interesting when the key is too big to be packed into a number and when the keys are not unique. For example, what if the key was a person's last name? That's more realistic than the unique, three-letter key given above. It's not obvious how to use hashing in this situation. We need a hash function to convert a variable-length string into a table index, and we need a way of dealing with multiple records that are mapped to the same table index.

The hash function can be almost anything. Here are some possibilities for the last-name key mapping into a 15-bit number:

Use first three letters, five bits per letter.
Use last three letters, five bits per letter.
Use first five letters, three bits per letter.
Multiply all letters, modulo 2^15.
Use five bits from first letter, two bits from next five letters.
Use three bits from length byte, three bits from first four letters.
Sum all letters, modulo 215.
Sum all letters letter(i)*2^i, modulo 2^15.

I'm sure you can imagine more. The hard part is picking a good one. You want a function that is quick to execute and provides a fairly random distribution of keys. #1, above, is not good because many of the names probably start with the same letters, providing a poor distribution. #4 is bad for two reasons: multiplying is usually not a good idea for speed's sake, and the distribution is not good because multiplying several numbers rarely produces odd numbers. #7 is bad because the sum of all of the letters will not approach 215. #8 tries to overcome this shortcoming by incorporating a shift into the add, and it may do well with some fine-tuning.

Any of the other hash functions may also do well with tuning. For instance, when you take three bits from a letter, how you do decide which three bits? The high three bits is obviously no good, since all upper-case letters start with the same three bits. The low three bits seems okay, but it's not perfect. H, P, and X map to 0; B, J, R, and Z map to 1; and G, O, and W map to 7; and those hardly seem like equal groupings. (But who knows? Maybe there are as many Greggs, Ouversons, and Wests as there are Browns, Joneses, Rathers, and Zwickkers.) The only way to know for sure is to try each possibility on the entire data set and see which one produces the best distribution. Unfortunately, you often have to develop your code with only a subset of the data, which requires you to rely more on your intuition.

Before getting too deep into selection of hashing functions, we need to address the second problem: what to do when two or more keys share the same index into the hash table, which is called a collision. There are several solutions, but the one I prefer is to add a field to each record that can point to another record. Then each table entry becomes the head of a linked list, with the other links being in the new field we added to each record.

Another solution is to use a secondary hash function to produce another index or just add a number (relatively prime to the table size) to the index to get a new index. In either case, you have the possibility of another collision, which requires finding another hash index, and so on. This solution has the requirement that the hash table have at least one entry for each record.

Either case requires more work during the search. You must compare the key against the record found, and, if it is not equal, you must check the next record, whether from the linked list or from another hash table entry. And similarly, insertion and deletion become more complicated as well.

So hashing isn't quite as simple as it appeared in our first example, but it is still very useful when search time is critical. A classic usage for hashing is in compiler symbol tables. A program may have thousands of symbols which a compiler must look up very frequently by name as it scans the source code. I implemented hashing in the MacForth and PowerMacForth vocabularies, and I know several other Forths also use hashing.

Listing One contains some generic hashing code. FILL-TABLE inefficiently allocates records on the fly in the data space, but the basic words (INSERT-RECORD, DELETE-RECORD, and FIND-RECORD) provide a good initial base for anyone trying to get started with hashing. The code uses the linked-list approach for collision handling.

ANALYZE-HASH is a useful word for getting some information about how well the hash function maps the keys. It produces lines numbered zero through nine which show how many linked lists there are of that size (or more, in the case of nine), and another line that shows the average list length (not counting the empty lists). You want to minimize that value. Another useful analysis tool I have used in the past is to plot the hash table showing the list length at every entry. Sometimes that reveals patterns of indices that are never hit or that are hit too often.

Table One gives the average list lengths for various hash functions given a data set of 3448 Forth words and hash table sizes of 1024 and 4096 entries.

Table One. Average list lengths of various hash functions.

Hash Function                            W/1024       W/4096
Low bits of first three chars             5.23         3.44
Low bits of length and first two chars    4.36         2.30
Low bits of last three chars              6.31         4.41
Product of first three chars              6.64         3.91
Sum of char(i) * 2^i                      3.56         1.57

All of these hash functions provide decent results. A binary search of the same data would require us to look at log2 3448=11.8 records. The last one is very close to optimal (3448/1024=3.37), and is a favorite of mine. It is similar to the function used in PowerMacForth, and it has the advantage that many characters factor into the result.

This table is the first mention of hash table size since our original, 64K hash table. The more entries you have in the table, the better your hashing will work (because there will be fewer collisions), but, of course, it takes more memory. The time/space tradeoff is what hashing is all about.

Happy hashing!

View Xan Gregg's Forth hashing code on the Web.

Download the hashing code via ftp.

This article first appeared in Forth Dimensions XVII/4.

Readers can contact Xan Gregg via e-mail. He is a Macintosh programmer at Scholastic, Inc. ,working on educational software, and lives in Durham, North Carolina. Xan has been programming with MacForth since 1984 and, in his free time, plays Ultimate Frisbee.

Hashing in Forth

Xan Gregg Durham, North Carolina

Xan Gregg
Durham, North Carolina