One of the most common activities in programming is storing and retrieving lists of data. You may need to store a list of user names,
process ids, access keys, symbols, file names, URLs, or whatever. How should you go about storing and retrieving items from the list?
If the list is heavy duty, needs to retrieved by a number of different criteria,
and must be persistent over time, you'll probably use a database. But for simple lists that only need to be maintained while the program
is running, you only need grab a basic data structure to hold it. There are many to choose from: arrays, linked lists, binary trees, B-trees,
and hash tables, to name a few.
Arrays
First, consider the array. Also called a vector, this is probably the simplest data structure to use since just about every programming language
provides direct support for creating and indexing them. Whenever you want to hold a list of data, what could be simpler than creating an
array to hold it. Retrieving the data later is quite simple, if not the most efficient means: simply loop thru each index until the required
item is found:
int
findItem (char *item, char *array[], int arraysize)
{
int i;
for (i = 0; i < arraysize; ++i)
{
if (strcmp (array[i], item) == 0)
return i;
}
return -1;
}
There are two problems with using arrays:
Firstly, arrays in C/C++ must be a fixed size, so you have to declare in advance how large the list is. If you're not sure,
then you'll need to make it very large to handle a possible upper bound, which could then lead to large amounts of wasted space. Alternately,
you can write code to grow the array on-the-fly, although this will reduce the efficiency because of extra memory allocations and potentially
large amounts of data copying.
Secondly, the access time to retrieve a particular item grows in direct proportion to the size of the list.
If the number of list items becomes ten times larger, it may take up to ten times longer to find any particular item.
Linked Lists
A linked list will avoid the fixed size problem altogether. They are dynamically sized and can accomodate a range of list sizes from very small
to huge. A linked list is merely a collection of "nodes" -- i.e. a structure that contains, in addition to the data itself, a pointer to
another node. This pointer is typically named "next", to signify the next node in the list. The canonical search algorithm for linked lists
looks like this:
node *
findItem (char *item, node *head)
{
node *p;
for (p = head; p; p = p->next)
{
if (strcmp (p->data, item) == 0)
return p;
}
return NULL;
}
Although the linked list avoids dealing with fixed limits on the number of items in the list, it does nothing to help with access times.
Like arrays, the search for a particular data item is linear -- i.e. you must start at the beginning and search thru every list item until
you find the desired one.
There are several ways to handle this. You could insert the list items in-order as you create the list and then use a binary search later
to find an item. Or you could store the list, instead, in a dynamic container such as a binary tree or one of its various offshoots such as B-Trees.
These can be very efficient solutions, although somewhat tricky to implement. They may also be overkill.
If all you need is a relatively static list of items that, once created, will rarely (if ever) need to be updated, but, instead, requires
accessing the contents as quickly as possible, then what you want is a hash table.
Hash Tables
Hash tables are wonderful. They are extremely simple to write and provide blazingly fast access times. A properly implemented hash table will
outperform any other data structure for retrieving data items from a list. They do have a few downsides, but they're pretty small and easily
manageable.
The idea of a hash table is to divide up all the list items into a collection of sub-divisions which are usually called "buckets". Each bucket
holds a small list of items that can be searched thru very quickly. A simple mathematical function, called a "hash" computes the bucket value for each
list item.
Although there are several variations in how to implement hash tables, by far the most common (especially in C/C++) is to implement
a hash table as an array of linked lists. That is, the hash table is a fixed sized array of node pointers. Each pointer is the head of a linked
list. In this way, even though the array itself has a fixed size, each index (bucket) points to a linked list that can grow indefinitely.
Thus, the hash table has no limits as to how many items can be added.
Here's what a sample hash table looks like:
[0] item -> item -> NULL
[1] item -> NULL
[2] NULL
[3] item -> NULL
...
[N] item -> item -> NULL
Here, the first bucket (index 0) holds two items. The second bucket has one item. The third bucket (index 2) is empty. Etc.
Hash functions
The hash function takes the item data as input and computes the bucket value where it should be placed. The hash function should be very quick,
and it should distribute all the list items as evenly as possible across the buckets.
How do you compute the hash value? Well, there are about a bazillion variations out there, but they all have a common scheme. For a fixed size
array as above, the new item will have its data run thru some arithmetic or bitwise algorithm and the resulting integer value is then reduced to
a value in the range [0, N] where N is the largest array index.
As a demonstration, consider one of the simplest possible hash functions for textual strings. Assume that each string starts with an alphabetic
character whereby the first letter is randomly distributed across all 26 letters of the alphabet. Then a straightforward hash would be:
int
hash (char *s)
{
return (tolower (*s) - 'a');
}
This would yield 26 buckets. All list items that begin with the letter 'a' would go in the first bucket. All those beginning with 'b' would go
into the second bucket, etc.
Unfortunately, in most real world scenarios, this wouldn't work at all. Frequently, the first character of the string data would not be an
alphabetic letter. Also, you want the hash function to handle common situations where the items have many similar characters at the beginning
of the string: for example, you are storing file names in a list and have
"/boot/beos/system/lib/libbe.so"
"/boot/beos/system/lib/libnet.so"
"/boot/beos/system/lib/libroot.so"
. . .
If these all hash to the same bucket, you have to search thru a lot of characters before you find the
right item.
For the example hash function above, the worst case scenario is that every item in the list happens to start with the same letter. If, for instance,
all the items begin with 'a', then the entire table will be stored in the first bucket. The hash table has degenerated into a linked list.
To avoid this, the hash function must be chosen so that the list items are distributed evenly among the buckets. No individual bucket should
(hopefully) contain more than a few list items. This is difficult to do in general. Vast numbers of PhD dissertations have been written in the search for
the "perfect" hash function. Unfortunately, there is no best hash function to use. It will depend on the type and distribution of list items
to be added and the number of buckets.
The good news is that there are a number of simple, general purpose hashes available and they work
just fine for most cases.
The simplest general purpose hash function for hashing strings that I know of is this:
int
hash (char *s)
{
unsigned int h = 0;
while (*s)
{
h = (h << 1) ^ *s++;
}
return (h % TableSize);
}
Here, of course, 'TableSize' refers to a global variable that holds the size of the hash table array.
I first saw this in a C++ book by Bjarne Stroustrup about 10 years ago (it was for a small calculator program where he used a hash table to store
a symbol table for variables). Since then, I've seen this hash function several times in different source code examples.
I like it alot. It's very
efficient and yet so doggone simple. Basically, you just go thru each character of the string, xor'ing its value into the hash integer (which is
very much liked "adding" in the character), then shift the value left by one bit. It's rather like treating the string as one long variably sized
binary (bit) integer whose value is crunched within the bitsize of a standard integer.
I've used the hash function above for many programs that stored strings in a hash table, and I've found that it works very well -- as well,
in most cases, as other, more complex hash functions. However, just to give another example, below is the hash function used in ELF files
(ELF files can store symbol tables and always use the following hash function):
unsigned long
ElfHash (const unsigned char *s)
{
unsigned long h = 0, g;
while (*s)
{
h = (h << 4) + *s++;
if (g = h & 0xf0000000)
h ^= g >> 24;
h &= ~g;
}
return h;
}
No matter what algorithm you use for a hash function, its usefulness ultimately comes down to how many buckets you have in the table. A large
number of buckets will use (probably waste) alot of space and will require a very carefully crafted hash function to make use of all the available
indexes. Likewise, a small number of buckets will produce alot of collisions (i.e. different items that hash to the same bucket) no matter what
hash function you use, good or sloppy.
The trick is to find a reasonable number of buckets to allocate and then use a hash function that works well with the type of list data and
distributes evenly across the given number of buckets. The standard advice for hash tables is to allocate about twice as many buckets as expected
list items. For example, if you expect to insert about 1000 items into the table, allocate about 2000 buckets to hold them. This may sound wasteful,
but it's designed to avoid too many collisions. If your hash is working well enough, you might be able to lower this ratio. In any event,
you'll want to make the number of buckets a prime number because this will also minimize the number of collisions.
Lookups
Generally, you want to do one of two things with a hash table: insert new items or lookup existing items. Since inserting new items also involves
looking thru the table, both activities will need to perform lookups. This common code is usually factored out into a single 'lookup' function:
node *
lookup (char *s, bool create)
{
int h = hash (s);
node *t;
// lookup string in current table
for (t = Table[h]; t; t = t->next)
{
if (strcmp (t->string, s) == 0)
return t;
}
// not found, so add it if specified
if (create)
{
t = (node *) malloc (sizeof (node));
if (t != NULL)
{
t->string = strdup (s);
t->next = Table[h]; // 't' is now the head
Table[h] = t;
}
}
return t;
}
This assumes a hash table stored in a global array called Table, and where each node contains a string element called "string".
The boolean argument determines whether the item should be inserted if not found.
Since the hash function gives the correct bucket for the item, the search loop is simply the standard 'find' algorithm for
linked lists. If a new item is to be created, a new node is allocated and inserted into the bucket's list. This ensures that
the list of pointers in Table[0] ... Table[N] always point to the list head.
Using this function, it is trivial to implement the
'insert' and 'find' functions:
void
insert (char *s)
{
lookup (s, true);
}
bool
find (char *s)
{
return (lookup (s, false) != NULL);
}
Summary
Hash tables are very easy to write and alot of fun to work with. They can be easily adapted to different purposes by just changing the
the definition of the node structure to include whatever type of data you need to store.
You could, of course, try to write a generic hash table that can be used for any type of data. I've tried this myself on a few occasions,
and I find that, generally, it's easier to reimplement a hash table for each particular program than it is to try to create a one-size-fits-all
solution. They are so simple to make, it just isn't worth the bother, IMO.
The Standard Template Library (STL) in C++, and many scripting languages (e.g. Python) offer generic container objects such as sets, dictionaries, etc.
These can be useful, but how do you suppose they are implemented? As hash tables, in most instances. It's in a programmer's best interest to
understand how these basic data structures can be implemented and used. At the very least, it could make understanding the various container classes
provided by your programming environment easier. At best, you might find that you can implement your own hash table that handles a given problem
better than what was given to you.
I've written a sample program that implements a basic hash table for strings. The programs reads all the "words" from a given text file
and then inserts them into a hash table. The size of the hash table is given on the command line. Some basic statistics are gathered and displayed.
The contents of the hash table itself are written out to a text file called 'table.dump'. Enjoy.
Source Code:
hashtab.zip