OpenBeOS Project - Displaying Newsletter

Displaying Newsletter

Newsletter Archive

Issue 5, 18 Nov 2001

In This Issue:

Dictionaries, the OBOS way by Michael Phipps
Hash tables by Daniel Reinhold
What is an operating system? by Michael Phipps


Dictionaries, the OBOS way	by Michael Phipps

I was trying hard to come up with a newsletter topic and I realized that there may be some confusion about kits, clients and servers, and the Be/OBOS way of things.

Client/Server, a buzzword if there ever was one, is an old, elegant concept. The idea is that one entity, the server, offers services. The other entity uses those services. Imagine a Chinese restaurant (MMMmmmmm) as a server. You communicate with the restaurant, giving them what they need to fulfil your order, and they respond with your results. So simple.

Be's marketing department calls these kits. There are three parts to a kit. The server portion of the kit (i.e. part 1) is started from bootscript and runs all of the time -- app_server is a good example.

The client side has two parts: the shared library, and your program. The shared library holds all of the Client/Server communication. This communication, which takes place via BMessages, talks to the server and actually gets the work done. User level code would be very ugly if there were SendMessage calls all over the place. Not to mention hard to write. So Be wrapped those SendMesssage calls in C++ classes, giving us the API that we have all come to know and love. Finally, there is your program. Lovingly crafted to use the C++ API, blissfully ignorant of all of the communication being carried on at its behest.

For my example, I implemented a dictionary. First, the old fashioned way -- a single app that reads the whole list, stores it in an STL set, then queries the list for particular words that it is looking for.

Then I ported the application to a client/server model. The server is a BApplication that waits for a BMessage to come in. When it gets that message, it finds the strings that are to be looked up ("lookupWords") within and looks them up in the set. If they are there, it adds a bool to the BMessage with a value of true, otherwise, it adds a false.

The shared library is a simple C++ class that connects a BMessenger to the server. When Lookup is called, it sends messages to the server to find the words and returns the results to the caller. Finally, I wrote a trial app that uses the class.

There is nothing magic about these clients and servers. Really, any application that can be scripted is a client/server application.

Note: the "words" file is hardcoded to exist as /etc/word_dictionary/words. This file is part of the normal "optional items" install. If you do not have one, you could certainly make one -- one word per line. Note, also, that if you build the sample code, the libDict.so file must go into your ~/config/lib directory or client apps will fail to run.

I hope that this article takes some of the mystery out of client/server (a buzzword if I ever heard one) and kits. Anyone can write a kit. I would encourage you to roll your own. Maybe even something that we can all reuse.

Source Code:
client-server-dictionary.zip


Hash tables	by Daniel Reinhold

One of the most common activities in programming is storing and retrieving lists of data. You may need to store a list of user names, process ids, access keys, symbols, file names, URLs, or whatever. How should you go about storing and retrieving items from the list?

If the list is heavy duty, needs to retrieved by a number of different criteria, and must be persistent over time, you'll probably use a database. But for simple lists that only need to be maintained while the program is running, you only need grab a basic data structure to hold it. There are many to choose from: arrays, linked lists, binary trees, B-trees, and hash tables, to name a few.

Arrays

First, consider the array. Also called a vector, this is probably the simplest data structure to use since just about every programming language provides direct support for creating and indexing them. Whenever you want to hold a list of data, what could be simpler than creating an array to hold it. Retrieving the data later is quite simple, if not the most efficient means: simply loop thru each index until the required item is found:

int
findItem (char *item, char *array[], int arraysize)
    {
    int i;
    for (i = 0; i < arraysize; ++i)
        {
        if (strcmp (array[i], item) == 0)
            return i;
        }
    return -1;
    }

There are two problems with using arrays:

Firstly, arrays in C/C++ must be a fixed size, so you have to declare in advance how large the list is. If you're not sure, then you'll need to make it very large to handle a possible upper bound, which could then lead to large amounts of wasted space. Alternately, you can write code to grow the array on-the-fly, although this will reduce the efficiency because of extra memory allocations and potentially large amounts of data copying.

Secondly, the access time to retrieve a particular item grows in direct proportion to the size of the list. If the number of list items becomes ten times larger, it may take up to ten times longer to find any particular item.

Linked Lists

A linked list will avoid the fixed size problem altogether. They are dynamically sized and can accomodate a range of list sizes from very small to huge. A linked list is merely a collection of "nodes" -- i.e. a structure that contains, in addition to the data itself, a pointer to another node. This pointer is typically named "next", to signify the next node in the list. The canonical search algorithm for linked lists looks like this:

node *
findItem (char *item, node *head)
    {
    node *p;
    for (p = head; p; p = p->next)
        {
        if (strcmp (p->data, item) == 0)
            return p;
        }
    return NULL;
    }

Although the linked list avoids dealing with fixed limits on the number of items in the list, it does nothing to help with access times. Like arrays, the search for a particular data item is linear -- i.e. you must start at the beginning and search thru every list item until you find the desired one.

There are several ways to handle this. You could insert the list items in-order as you create the list and then use a binary search later to find an item. Or you could store the list, instead, in a dynamic container such as a binary tree or one of its various offshoots such as B-Trees. These can be very efficient solutions, although somewhat tricky to implement. They may also be overkill.

If all you need is a relatively static list of items that, once created, will rarely (if ever) need to be updated, but, instead, requires accessing the contents as quickly as possible, then what you want is a hash table.

Hash Tables

Hash tables are wonderful. They are extremely simple to write and provide blazingly fast access times. A properly implemented hash table will outperform any other data structure for retrieving data items from a list. They do have a few downsides, but they're pretty small and easily manageable.

The idea of a hash table is to divide up all the list items into a collection of sub-divisions which are usually called "buckets". Each bucket holds a small list of items that can be searched thru very quickly. A simple mathematical function, called a "hash" computes the bucket value for each list item.

Although there are several variations in how to implement hash tables, by far the most common (especially in C/C++) is to implement a hash table as an array of linked lists. That is, the hash table is a fixed sized array of node pointers. Each pointer is the head of a linked list. In this way, even though the array itself has a fixed size, each index (bucket) points to a linked list that can grow indefinitely. Thus, the hash table has no limits as to how many items can be added.

Here's what a sample hash table looks like:

    [0] item -> item -> NULL
    [1] item -> NULL
    [2] NULL
    [3] item -> NULL
    ...
    [N] item -> item -> NULL

Here, the first bucket (index 0) holds two items. The second bucket has one item. The third bucket (index 2) is empty. Etc.

Hash functions

The hash function takes the item data as input and computes the bucket value where it should be placed. The hash function should be very quick, and it should distribute all the list items as evenly as possible across the buckets.

How do you compute the hash value? Well, there are about a bazillion variations out there, but they all have a common scheme. For a fixed size array as above, the new item will have its data run thru some arithmetic or bitwise algorithm and the resulting integer value is then reduced to a value in the range [0, N] where N is the largest array index.

As a demonstration, consider one of the simplest possible hash functions for textual strings. Assume that each string starts with an alphabetic character whereby the first letter is randomly distributed across all 26 letters of the alphabet. Then a straightforward hash would be:

int
hash (char *s)
    {
    return (tolower (*s) - 'a');
    }

This would yield 26 buckets. All list items that begin with the letter 'a' would go in the first bucket. All those beginning with 'b' would go into the second bucket, etc.

Unfortunately, in most real world scenarios, this wouldn't work at all. Frequently, the first character of the string data would not be an alphabetic letter. Also, you want the hash function to handle common situations where the items have many similar characters at the beginning of the string: for example, you are storing file names in a list and have

"/boot/beos/system/lib/libbe.so"
"/boot/beos/system/lib/libnet.so"
"/boot/beos/system/lib/libroot.so"
. . .

If these all hash to the same bucket, you have to search thru a lot of characters before you find the right item.

For the example hash function above, the worst case scenario is that every item in the list happens to start with the same letter. If, for instance, all the items begin with 'a', then the entire table will be stored in the first bucket. The hash table has degenerated into a linked list.

To avoid this, the hash function must be chosen so that the list items are distributed evenly among the buckets. No individual bucket should (hopefully) contain more than a few list items. This is difficult to do in general. Vast numbers of PhD dissertations have been written in the search for the "perfect" hash function. Unfortunately, there is no best hash function to use. It will depend on the type and distribution of list items to be added and the number of buckets.

The good news is that there are a number of simple, general purpose hashes available and they work just fine for most cases. The simplest general purpose hash function for hashing strings that I know of is this:

int
hash (char *s)
    {
    unsigned int h = 0;
	
    while (*s)
        {
        h = (h << 1) ^ *s++;
        }
		
    return (h % TableSize);
    }

Here, of course, 'TableSize' refers to a global variable that holds the size of the hash table array. I first saw this in a C++ book by Bjarne Stroustrup about 10 years ago (it was for a small calculator program where he used a hash table to store a symbol table for variables). Since then, I've seen this hash function several times in different source code examples.

I like it alot. It's very efficient and yet so doggone simple. Basically, you just go thru each character of the string, xor'ing its value into the hash integer (which is very much liked "adding" in the character), then shift the value left by one bit. It's rather like treating the string as one long variably sized binary (bit) integer whose value is crunched within the bitsize of a standard integer.

I've used the hash function above for many programs that stored strings in a hash table, and I've found that it works very well -- as well, in most cases, as other, more complex hash functions. However, just to give another example, below is the hash function used in ELF files (ELF files can store symbol tables and always use the following hash function):

unsigned long
ElfHash (const unsigned char *s)
    {
    unsigned long h = 0, g;
    while (*s)
        {
        h = (h << 4) + *s++;
        if (g = h & 0xf0000000)
            h ^= g >> 24;
        h &= ~g;
        }
    return h;
    }

No matter what algorithm you use for a hash function, its usefulness ultimately comes down to how many buckets you have in the table. A large number of buckets will use (probably waste) alot of space and will require a very carefully crafted hash function to make use of all the available indexes. Likewise, a small number of buckets will produce alot of collisions (i.e. different items that hash to the same bucket) no matter what hash function you use, good or sloppy.

The trick is to find a reasonable number of buckets to allocate and then use a hash function that works well with the type of list data and distributes evenly across the given number of buckets. The standard advice for hash tables is to allocate about twice as many buckets as expected list items. For example, if you expect to insert about 1000 items into the table, allocate about 2000 buckets to hold them. This may sound wasteful, but it's designed to avoid too many collisions. If your hash is working well enough, you might be able to lower this ratio. In any event, you'll want to make the number of buckets a prime number because this will also minimize the number of collisions.

Lookups

Generally, you want to do one of two things with a hash table: insert new items or lookup existing items. Since inserting new items also involves looking thru the table, both activities will need to perform lookups. This common code is usually factored out into a single 'lookup' function:

node *
lookup (char *s, bool create)
    {
    int   h = hash (s);
    node *t;
	
    // lookup string in current table
    for (t = Table[h]; t; t = t->next)
        {
        if (strcmp (t->string, s) == 0)
            return t;
        }
	
    // not found, so add it if specified
    if (create)
        {
        t = (node *) malloc (sizeof (node));
        if (t != NULL)
            {		
            t->string = strdup (s);
			
            t->next = Table[h];  // 't' is now the head
            Table[h] = t;
            }
        }
	
    return t;
    }

This assumes a hash table stored in a global array called Table, and where each node contains a string element called "string". The boolean argument determines whether the item should be inserted if not found.

Since the hash function gives the correct bucket for the item, the search loop is simply the standard 'find' algorithm for linked lists. If a new item is to be created, a new node is allocated and inserted into the bucket's list. This ensures that the list of pointers in Table[0] ... Table[N] always point to the list head.

Using this function, it is trivial to implement the 'insert' and 'find' functions:

void
insert (char *s)
    {
    lookup (s, true);
    }
bool
find (char *s)
    {
    return (lookup (s, false) != NULL);
    }

Summary

Hash tables are very easy to write and alot of fun to work with. They can be easily adapted to different purposes by just changing the the definition of the node structure to include whatever type of data you need to store.

You could, of course, try to write a generic hash table that can be used for any type of data. I've tried this myself on a few occasions, and I find that, generally, it's easier to reimplement a hash table for each particular program than it is to try to create a one-size-fits-all solution. They are so simple to make, it just isn't worth the bother, IMO.

The Standard Template Library (STL) in C++, and many scripting languages (e.g. Python) offer generic container objects such as sets, dictionaries, etc. These can be useful, but how do you suppose they are implemented? As hash tables, in most instances. It's in a programmer's best interest to understand how these basic data structures can be implemented and used. At the very least, it could make understanding the various container classes provided by your programming environment easier. At best, you might find that you can implement your own hash table that handles a given problem better than what was given to you.

I've written a sample program that implements a basic hash table for strings. The programs reads all the "words" from a given text file and then inserts them into a hash table. The size of the hash table is given on the command line. Some basic statistics are gathered and displayed. The contents of the hash table itself are written out to a text file called 'table.dump'. Enjoy.

Source Code:
hashtab.zip


What is an operating system?	by Michael Phipps

Merriam Webster (online) has this to say:

Function: noun
Date: 1961
:software that controls the operation of a computer and directs the processing
of programs (as by assigning storage space in memory and controlling input and
output functions)

whatis.com has a lot more to say, referring to GUIs, parallel processing, and more. Still, all in all, operating systems aren't what they used to be. When MS-DOS first came out, it shipped with very little in the way of useful programs. Edlin and debug do not constitute useful to most people. MS-DOS, out of the box, provided an environment for software to run in. Some abstraction of the hardware, a file system and some small ability to maintain your files.

MacOS raised the bar a great deal, bringing not only a GUI to the table, but also two "real" applications -- MacWrite and MacPaint. Not to be left behind, Microsoft introduced Windows, which shipped with Paint and Write (great names, folks). Along with MineSweeper and Solitare. Woohoo. Ever since, the ante has been upped every release. Microsoft now tells us that it isn't really an operating system without a browser. Apple tells us that unless you can burn DVDs, your computer is inferior. Viewing movies, making movies, browsing web pages, reading email, all of this out of the box.

It seems like an operating system is much like a motor of yesteryear. You see, you used to be able to buy a motor. Plug it in and it spins. What good is that? None. You had to buy attachments for them. Mixing attachments. Fan blades. Etc. Now, people buy a fan or a mixer. The operating system isn't going the way of the motor, though. Motors became cheaper. Operating systems are becoming more expensive to build. Instead of selling a word processor with a built-in OS (the internet appliance way), people want an operating system with a word processor built in.

Where does that leave our baby, BeOS (or even OpenBeOS)? R5 shipped with something in the neighborhood of 50 applications, Not bad, for an OS with no applications. Still, the ability to work out of the box is an interesting concept. The big boys out there seem to have a very maximal view of operating systems. I honestly think that Microsoft would bundle Office with Windows, if they thought that it could make that fly.

The opposite perspective would be to have a minimal install of the operating system. The Joe Friday "just the facts" approach. A tiny download of a few hundred K that contains the kernel, the servers and preferences apps. Maybe bash is a separate download. Maybe GNU command line tools are a separate download. Net+? Separate download. Ditto for development tools.

Some sort of a high quality online package manager would be needed to make the install/backout of apps smooth and easy. People could custom design their environment. Suddenly have a need to use IRC? Fire up the package manager and grab the IRC client. Of course, all of this could be provided on CDs, assuming that pressed CDs were ever to be made.

In fact, because of the nature of BeOS and its server based functionality, it may well be possible to upgrade whole pieces of the OS (new servers) right from this package manager. The package manager should be able to check online, as well, to see if there are upgrades for packages that you already have installed. Configurably, maybe even upgrading the application for you while you are sleeping, keeping the old one for 30 days or some such criteria.

The issue that could well stand in the way between a nice editorial and the reality is money. As much as we, the open source community like to forget about money, it is a real concern. We depend on (sponge off of, really) sourceforge and others. We use their bandwidth, their tech support and their services to build our product. I doubt, though, that sourceforge would be amenable to hosting a package management system of the sort that I describe. Even so, could one place commercial applications there? And doesn't BeBits do this for us, today?

All good questions for the community to debate. Mini vs maxi? Free vs commercial? Centralized applications vs distributed on individual web sites? I look forward to your comments.