Be Newsletters - Volume 4: 1999

Issue 4-28, July 14, 1999

Be Engineering Insights: BeOS Kernel Programming Part IV: Bus Managers

By Brian Swetland

This week we will once again delve into the mysteries of writing code that operates in the realm of the BeOS kernel. Recently, Ficus expounded on kernel modules in his article

Be Engineering Insights: Programmink Ze Quernelle, Part 3: Le Module

We'll dig a bit deeper into this subject by looking at a specific class of modules—bus managers.

Bus managers provide an abstraction layer to drivers. Instead of burdening drivers with the knowledge of how to interact with devices on a bus (i.e., PCI, ISA, SCSI, USB), a common interface to each class of bus is provided.

The PCI bus manager allows a driver to locate and interact with devices on the PCI bus (consult your local header file /boot/develop/headers/be/drivers/PCI.h for gory details). A driver, once it obtains a pci_module_info pointer from get_module(B_PCI_MODULE_NAME, (module_info**) pci), may use provided functions like get_nth_pci_info() to iterate over devices on the bus. Once located, functions like read_io_8(), write_io_32(), and friends allow iospace access (on x86 platforms where there is a separate iospace).

PCI and ISA are fairly boring bus managers. Where bus managers get interesting is when they have two sides. On the "top" side, an interface is provided to drivers. On the "bottom" side, an interface is provided to busses. A driver using a bus manager like this needs to know nothing about how the underlying bus works—it only needs to know about the interface provided by the bus manager.

Brian Talks About SCSI—No Surprises Here

Let's take a concrete example—SCSI. The SCSI standard defines a number of things, but the most interesting to a driver writer is a command protocol. To interact with a SCSI device, a 6, 10, or 12 byte command is sent to the device, some data is sent or received (depending on the command), and a response code is returned. The specification defines the gritty details of the electrical and signaling properties of the bus, arbitration, and other stuff that someone writing a tape driver doesn't want to worry about.

SCSI devices all accept the same basic commands and adhere to the electrical and signaling standards defined in the specification. SCSI controller cards all use totally different mechanisms for issuing these commands (ranging from twiddling the bits almost directly to passing commands to an intelligent controller using an elegant shared-memory mailbox mechanism).

The SCSI bus manager provides an interface to drivers that adheres to the Common Access Method (ANSI X3.232-1996). ^[3] It uses three function calls -- one to allocate a command block, one to issue a command, and one to release the command block afterwards. The command block is filled with the SCSI command, information about the data to send or receive, and some other useful flags. The command block refers to the SCSI device using three numbers: the path (which maps to a specific hardware controller), the target (the SCSI ID of a device on the bus), and the lun (the SCSI Logical Unit Number, which is 0 for most devices). The driver using the SCSI bus manager need not concern itself with what hardware is actually associated with a particular path.

/* from /boot/develop/headers/be/drivers/CAM.h */

struct cam_for_driver_module_info {
    bus_manager_info minfo;
    CCB_HEADER * (*xpt_ccb_alloc)(void);
    void (*xpt_ccb_free)(void *ccb);
    long (*xpt_action)(CCB_HEADER *ccbh);
};

#define B_CAM_FOR_DRIVER_MODULE_NAME \
    "bus_managers/scsi/driver/v1"

The SCSI bus manager doesn't know how to talk to any specific host controllers. It relies on a number of modules (living in /system/add-ons/kernel/busses/scsi) to actually speak to the hardware. The bus manager is a bookkeeper—it loads the scsi bus modules and lets them search for supported hardware. If they find any, they inform the bus manager with the xpt_bus_register() call and are kept loaded. If not, they're unloaded. Registered busses are assigned a number by which drivers can address them.

/* from /boot/develop/headers/be/drivers/CAM.h */

struct cam_for_sim_module_info {
    bus_manager_info minfo;
    long (*xpt_bus_register) (CAM_SIM_ENTRY *sim);
    long (*xpt_bus_deregister) (long path);
};

#define B_CAM_FOR_SIM_MODULE_NAME \
    "bus_managers/scsi/sim/v1"

The SCSI bus manager itself exists as a module which lives at /system/add-ons/kernel/bus_managers/scsi. It exports two distinct modules -- (bus_managers/scsi/driver/v1 and bus_managers/scsi/sim/v1)—from one binary (this is a neat feature of the kernel module system).

A Module Of Many Names

How does get_module(B_CAM_FOR_DRIVER_MODULE_NAME, &cam) actually get the right module, you may ask? The module manager looks for modules by first prepending the user config directory for kernel add-ons and then prepending the system config directory for kernel add-ons to the module name and looking for a binary. If that doesn't exist, subpaths are sliced off the end until a match is found (or there are no more subpaths). So in the case of this get_module() call, the following items are attempted:

/boot/home/config/add-ons/kernel/bus_managers/scsi/driver/v1
/boot/home/config/add-ons/kernel/bus_managers/scsi/driver
/boot/home/config/add-ons/kernel/bus_managers/scsi
/boot/home/config/add-ons/kernel/bus_managers
/boot/beos/system/add-ons/kernel/bus_managers/scsi/driver/v1
/boot/beos/system/add-ons/kernel/bus_managers/scsi/driver
/boot/beos/system/add-ons/kernel/bus_managers/scsi

Here it stops because a file is found. Within that binary is a symbol called "modules" that contains the list of modules which exist in the binary:

module_info *modules[] =
{
    (module_info *) &cam_for_sim_module,
    (module_info *) &cam_for_driver_module,
    NULL
};

Should the correct module not exist in this list, the module manager continues to look until it exhausts all the possible binary names. At that point it must report failure.

The module_info structures referred to here include their full name (e.g., bus_managers/scsi/driver/v1), which is used to determine which module is desired. This feature allows a single module binary to support two different interfaces (like the SCSI bus managers driver and bus interface) or different versions of the same interface. What if we wanted to provide a new version of the driver interface (bus_managers/scsi/driver/v2)? We could include it in the modules list alongside the old interface (kept to provide backward compatibility). Old drivers would get the old version, new drivers would get the new one. Everyone would be happy.

More About Finding Modules

get_module() is well and good if you happen to know the name of the module you're looking for (which is the case for a driver hunting for the SCSI or PCI bus manager). What can you do if you don't know the name? The SCSI bus manager needs to try to load all available SCSI buses, but looking for them can be tricky. Luckily the kernel provides some handy tools:

/* somewhere in the bowels of scsi_cam.c */

void *ml;
size_t sz;
char name[B_PATH_NAME_LENGTH];


ml = open_module_list("busses/scsi/");
while((sz = B_PATH_NAME_LIST) &&
  (read_next_module_name(ml,name,&sz) == B_OK)){
  cam_load_module(name);
}
close_module_list(ml);

This snippet of code allows the SCSI bus manager to iterate over all available modules that have names starting with busses/scsi. The function cam_load_module() actually does a get_module() on the current module and sees if it registers itself with the SCSI bus manager or not.

Erratum for Be Engineering Insights: Device Drivers

By Rico Tudor

The close() function in the sample driver that appeared in my article last week contains an error. In addition to providing a correction, this erratum includes a full explanation of the device open/close protocol.

When user code invokes system call open(), a new "session" begins with a call to device open(). This results in the creation of a cookie (if you desire), and a file descriptor with reference count 1.

A subsequent call to system call dup() increments that ref count. Creating a new team with system call load_image() or fork() duplicates all the parent's file descriptors for use by the child, raising the ref counts accordingly.

The ref count of a file descriptor is decremented on the explicit call of system call close(), or the implicit system call close() when a team exits. A system call close() of a file descriptor with ref count 1 will trigger the device close. For a team with two or more threads, that final system call close() can create a subtle situation: one thread is blocked or running in one of the device read/write/ioctl functions, when the other thread calls device close.

The sole purpose of device close is to encourage blocked calls—of the same session—to unblock and go home.

Once the driver is free of all other callers for a given session, driver shutdown can proceed to the final stage: disabling the hardware, removing interrupt handlers, and freeing kernel resources. Your driver performs these chores in the device free function. The kernel guarantees that the call is single-threaded with respect to this session. Of course, you must remain vigilant for the threads of other sessions, and protect shared data appropriately: an example is ocsem protecting nopen.

The original device write function had a race condition between acquire_sem_etc() and has_signals_pending(), causing wbsem to miscount. has_signals_pending() is a function internal to the kernel, hence it is not documented and it should not be used. Instead, the revised code uses the return status of acquire_sem_etc(): now the semaphore is unchanged should an error occur. Not shown here is a mechanism to unblock I/O threads: you'll want one for drivers with slow, blocking I/O. Even fast devices would benefit if you expect lost interrupts or other erroneous hardware behavior.

static status_t
qq_close( void *v)
{
    return (B_OK);
}

static status_t
qq_free( void *v)
{
    struct client *c = v;
    struct device *d = c->d;
    acquire_sem( d->ocsem);
    if (--d->nopen == 0) {
        (*isa->write_io_8)( d->ioport+MCR, 0);
        (*isa->write_io_8)( d->ioport+IER, 0);
        remove_io_interrupt_handler( d->irq, qq_int, d);
    }
    release_sem( d->ocsem);
    free( v);
    return (B_OK);
}

static status_t
qq_write( void *v, off_t o, const void *buf, size_t *nbyte)
{
    cpu_status cs;
    struct client *c = v;
    struct device *d = c->d;
    uint n = 0;
    while (n < *nbyte) {
        status_t s = acquire_sem_etc( d->wbsem, 1,
                                    B_CAN_INTERRUPT, 0);
        if (s < B_OK) {
            *nbyte = n;
            return (s);
        }
        d->wcur = 0;
        d->wmax = min( *nbyte-n, sizeof( d->wbuf));
        memcpy( d->wbuf, (uchar *)buf+n, d->wmax);
        (*isa->write_io_8)( d->ioport+IER, IER_THRE);
        acquire_sem( d->wfsem);
        n += d->wmax;
        release_sem( d->wbsem);
    }
    return (B_OK);
}

Developers' Workshop: Copying Files

By Christopher Tate

In my last column, I provided a simple function called CopyFile(). Unsurprisingly, this function copies a file under the BeOS, including not only the "ordinary" data of the file but also any attributes that the file may include. This week, I'll extend the function to attempt to discern whether there is enough space on the destination volume for the file before actually performing the copy. The source code for this extended version of CopyFile() is available on the Be FTP site at this URL:

<ftp://ftp.be.com/pub/samples/storage_kit/CopyFile.zip>

Let's look at the new function prototype first:

status_t CopyFile(const entry_ref& source,
    const entry_ref& dest,
    void* buffer = NULL,
    size_t bufferSize = 0,
    bool preflight = false,
    bool createIndices = false);

There are two new arguments since last time: preflight and createIndices. The first of these specifies whether or not to analyze the source file to determine whether it will fit on the destination volume; the second indicates whether the copy routine should ensure that file attributes which are indexed on the source volume are also indexed on the destination volume.

I'll discuss indices a little later; first let's look at preflighting. In CopyFile.cpp there's a function called preflight_file_size() which estimates the storage required for a given file. I'll walk through its implementation briefly, starting with its prototype:

status_t preflight_file_size(const entry_ref& fileRef,
    const fs_info& srcInfo,
    const fs_info& destInfo,
    off_t* outBlocksNeeded);

First off, note that two of the arguments are fs_info structures. These structures, obtained through the Storage Kit's fs_stat_dev() function, describe whole file systems. Not all BFS volumes are the same; in particular, BFS supports a variety of fundamental block sizes. Storage on a disk isn't continuous, it's divided into discrete units called "blocks." Everything on a disk—file data, attribute data, and file system control structures—occupies an integral number of blocks. Even more A given file's data will consume more or fewer disk blocks depending on what the destination volume's block size is. For this reason, the preflight operation calculates the file's storage requirements in terms of blocks, not bytes.

Calculating the number of blocks consumed for the file's ordinary data is easy:

struct stat info;
off_t diskBlockSize = destInfo.block_size;
BFile file(&fileRef, O_RDONLY);
file.GetStat(&info);
off_t blocksNeeded = 1 + (off_t) ceil(double(info.st_size) /
double(diskBlockSize));

BFile::GetStat() returns a variety of useful information about a file; in this case we only care about its data size. We then count how many blocks the data will require, rounding up. The extra 1 is because there is a one-block file system control structure called an "inode" for every file. The inode is where the file system stores various information about the file such as when it was last modified, what user "owns" it, whether it's write-protected, etc.

In the case of a destination file system that does not support attributes, we're done counting blocks. However, the usual case of copying files onto BFS volumes is more interesting; attribute storage under BFS is complex. Proceeding through the code we see that preflight_file_size() checks whether the destination volume supports attributes by verifying that the B_FS_HAS_ATTR flag is set in its fs_info structure, then enters the following loop:

off_t fastSpaceUsed = 0;
char attrName[B_ATTR_NAME_LENGTH];
file.RewindAttrs();
while ((err = file.GetNextAttrName(attrName)) ==
B_NO_ERROR)
{
    const size_t FAST_ATTR_OVERHEAD = 9;
    const off_t  INODE_SIZE = 232;

    attr_info info;
    file.GetAttrInfo(attrName, &info);
    off_t fastLength = info.size + strlen(attrName) +
    FAST_ATTR_OVERHEAD;
    if (fastSpaceUsed + fastLength < diskBlockSize -
    INODE_SIZE)
    {
        fastSpaceUsed += fastLength;
    }
    else
    {
        blocksNeeded += 1 + off_t(ceil(double(info.size) /
        double(diskBlockSize)));
    }

    // index calculations; see below
}

The principle of iterating over the source file's attributes is one we saw last time, but the rest of the code deserves some explanation. BFS uses two different schemes for storing attributes: a "fast" scheme in which attributes are actually stored in leftover space within the file's main inode, and a slower scheme once that space fills up. The above code calculates how much fast-area storage would be required to hold the attribute, then checks to see whether there's enough fast-area space available. If so, the attribute's storage requirements are added to the fast-area tally, otherwise a separate attribute inode and attribute data blocks will be used for it. In that case, the storage requirement calculation is the same as for the file's data portion.

The nine-byte FAST_ATTR_OVERHEAD constant is calculated based on the scheme that BFS uses for storing attributes in the fast area. The overhead is 4 bytes for the attribute's type, 2 bytes for a name-length indicator, 2 bytes for the attribute's data length, plus 1 byte for a trailing NULL at the end of the attribute's name. Similarly, the current size of the BFS inode structure is 232 bytes; whatever is left over in the inode block is available for fast attribute storage.

There is one more feature of the Be file system that complicates storage requirement estimation, and that's the concept of indexed attributes. BFS can be instructed to maintain an index of all files that contain a particular named attribute; this index then allows the file system to search for files whose attributes match various criteria by scanning the indices rather than having to scan all files. This is the secret of the Tracker's blazingly fast "Find..." capability. The cost of this feature is disk space: the file system has to duplicate the attribute data within the index structures on disk.

To estimate the amount of extra storage needed for indexed attributes, the preflight routine keeps track of the cumulative size of all indexed attributes within the attribute-scanning loop:

// index calculations
struct index_info indexInfo;
if (!fs_stat_index(srcInfo.dev, attrName, &indexInfo))
{
    indexedData += min(info.size, 256);
}

then, after all attributes have been examined:

blocksNeeded += 2 * indexedData / diskBlockSize + 2;

Indexed attribute data is duplicated within the volume's indices, up to a maximum of 256 bytes per attribute. We keep track of the total amount of data that will be added to indices by the copy operation, then estimate its storage cost as twice the number of blocks necessary to hold the data contiguously, plus a small amount to account for indexing of non-attribute data such as file names and modification times. The exact number of blocks required cannot be determined accurately because it depends on the state of the indices prior to the copy operation. Double the minimum is an educated guess based on the observation that, on average, the blocks within a given index structure under BFS tend to be about half-full.

It bears repeating that these are *estimates*, not precise calculations. This code may overestimate by a few blocks the amount of storage that will actually be consumed by the copy operation. That's as close an estimate as possible, since certain aspects of the file system -- particularly directory and index management—are somewhat nondeterministic. Because conservative estimates are safe, this is an acceptable inaccuracy for our purposes.

Now that I've shown you what's involved in predetermining the amount of storage required for a file, a word of caution is in order. Especially for files with large numbers of attributes, this preflighting function will be SLOW. One of the slowest operations in any file system is the "stat" function, examining a file's vital statistics. Under BFS, because attributes are practically files in themselves, getting attribute information via BNode::GetAttrInfo() is just as bad as BFile::GetStat() for performance. Note that the Tracker itself doesn't use this intricate procedure to preflight disk requirements for a copy operation; it uses a simpler, more conservative heuristic instead—one which doesn't involve getting info on all attributes. But if you know that you're copying files with large amounts of attribute data, or large numbers of files (such as an installer program might do), the more accurate estimation that I've presented here might be worth the trouble.

^[3] BeOS CAM Non-compliance (for nit-pickers)

When I said that the SCSI bus manager adhered to CAM, I lied. Our implementation diverges from the spec in several important ways:

The Execute SCSI IO request is Synchronous (no callback or polling as defined in the spec). You get to wait.
You must lock all memory that will be involved in a SCSI data transfer and should do so BEFORE allocating the CCB.
We currently ignore the no-disconnect, sync/no-sync, and tagged-queueing flags. Our SIMs will negotiate for the best protocol and speed they can handle.