This week we will once again delve into the mysteries of writing code that operates in the realm of the BeOS kernel. Recently, Ficus expounded on kernel modules in his article
Be Engineering Insights: Programmink Ze Quernelle, Part 3: Le Module
We'll dig a bit deeper into this subject by looking at a specific class of modules—bus managers.
Bus managers provide an abstraction layer to drivers. Instead of burdening drivers with the knowledge of how to interact with devices on a bus (i.e., PCI, ISA, SCSI, USB), a common interface to each class of bus is provided.
The PCI bus manager allows a driver to locate and interact with devices
on the PCI bus (consult your local header file
/boot/develop/headers/be/drivers/PCI.h
for gory details). A driver, once
it obtains a pci_module_info pointer from
get_module(B_PCI_MODULE_NAME, (module_info**) pci)
,
may use provided functions like get_nth_pci_info()
to iterate over devices on the bus. Once located, functions like
read_io_8()
, write_io_32()
,
and friends allow iospace access (on x86
platforms where there is a separate iospace).
PCI and ISA are fairly boring bus managers. Where bus managers get interesting is when they have two sides. On the "top" side, an interface is provided to drivers. On the "bottom" side, an interface is provided to busses. A driver using a bus manager like this needs to know nothing about how the underlying bus works—it only needs to know about the interface provided by the bus manager.
Let's take a concrete example—SCSI. The SCSI standard defines a number of things, but the most interesting to a driver writer is a command protocol. To interact with a SCSI device, a 6, 10, or 12 byte command is sent to the device, some data is sent or received (depending on the command), and a response code is returned. The specification defines the gritty details of the electrical and signaling properties of the bus, arbitration, and other stuff that someone writing a tape driver doesn't want to worry about.
SCSI devices all accept the same basic commands and adhere to the electrical and signaling standards defined in the specification. SCSI controller cards all use totally different mechanisms for issuing these commands (ranging from twiddling the bits almost directly to passing commands to an intelligent controller using an elegant shared-memory mailbox mechanism).
The SCSI bus manager provides an interface to drivers that adheres to the
Common Access Method (ANSI X3.232-1996).
[3]
It uses three function calls
-- one to allocate a command block, one to issue a command, and one to
release the command block afterwards. The command block is filled with
the SCSI command, information about the data to send or receive, and some
other useful flags. The command block refers to the SCSI device using
three numbers: the path
(which maps to a specific hardware controller),
the target
(the SCSI ID of a device on the bus), and the lun
(the
SCSI Logical Unit Number, which is 0 for most devices). The driver using
the SCSI bus manager need not concern itself with what hardware is
actually associated with a particular path.
/* from /boot/develop/headers/be/drivers/CAM.h */ struct cam_for_driver_module_info { bus_manager_infominfo
; CCB_HEADER * (*xpt_ccb_alloc
)(void); void (*xpt_ccb_free
)(void *ccb
); long (*xpt_action
)(CCB_HEADER *ccbh
); }; #defineB_CAM_FOR_DRIVER_MODULE_NAME
\ "bus_managers/scsi/driver/v1"
The SCSI bus manager doesn't know how to talk to any specific host
controllers. It relies on a number of modules (living in
/system/add-ons/kernel/busses/scsi
)
to actually speak to the hardware.
The bus manager is a bookkeeper—it loads the scsi bus modules and lets
them search for supported hardware. If they find any, they inform the bus
manager with the xpt_bus_register()
call and are kept loaded. If not,
they're unloaded. Registered busses are assigned a number by which
drivers can address them.
/* from /boot/develop/headers/be/drivers/CAM.h */ struct cam_for_sim_module_info { bus_manager_infominfo
; long (*xpt_bus_register
) (CAM_SIM_ENTRY *sim
); long (*xpt_bus_deregister
) (longpath
); }; #defineB_CAM_FOR_SIM_MODULE_NAME
\ "bus_managers/scsi/sim/v1"
The SCSI bus manager itself exists as a module which lives at
/system/add-ons/kernel/bus_managers/scsi
.
It exports two distinct modules --
(bus_managers/scsi/driver/v1
and
bus_managers/scsi/sim/v1
)—from one
binary (this is a neat feature of the kernel module system).
How does get_module(B_CAM_FOR_DRIVER_MODULE_NAME, &cam)
actually get the
right module, you may ask? The module manager looks for modules by first
prepending the user config directory for kernel add-ons and then
prepending the system config directory for kernel add-ons to the module
name and looking for a binary. If that doesn't exist, subpaths are sliced
off the end until a match is found (or there are no more subpaths). So in
the case of this get_module()
call, the following items are attempted:
/boot/home/config/add-ons/kernel/bus_managers/scsi/driver/v1 /boot/home/config/add-ons/kernel/bus_managers/scsi/driver /boot/home/config/add-ons/kernel/bus_managers/scsi /boot/home/config/add-ons/kernel/bus_managers /boot/beos/system/add-ons/kernel/bus_managers/scsi/driver/v1 /boot/beos/system/add-ons/kernel/bus_managers/scsi/driver /boot/beos/system/add-ons/kernel/bus_managers/scsi
Here it stops because a file is found. Within that binary is a symbol called "modules" that contains the list of modules which exist in the binary:
module_info *modules
[] = { (module_info *) &cam_for_sim_module
, (module_info *) &cam_for_driver_module
, NULL };
Should the correct module not exist in this list, the module manager continues to look until it exhausts all the possible binary names. At that point it must report failure.
The module_info structures referred to here include their full name
(e.g., bus_managers/scsi/driver/v1
), which is used to determine which
module is desired. This feature allows a single module binary to support
two different interfaces (like the SCSI bus managers driver and bus
interface) or different versions of the same interface. What if we wanted
to provide a new version of the driver interface
(bus_managers/scsi/driver/v2
)?
We could include it in the modules list
alongside the old interface (kept to provide backward compatibility). Old
drivers would get the old version, new drivers would get the new one.
Everyone would be happy.
get_module()
is well and good if you happen to know the name of the
module you're looking for (which is the case for a driver hunting for the
SCSI or PCI bus manager). What can you do if you don't know the name? The
SCSI bus manager needs to try to load all available SCSI buses, but
looking for them can be tricky. Luckily the kernel provides some handy
tools:
/* somewhere in the bowels of scsi_cam.c */ void *ml
; size_tsz
; charname
[B_PATH_NAME_LENGTH
];ml
=open_module_list
("busses/scsi/"); while((sz
=B_PATH_NAME_LIST
) && (read_next_module_name
(ml
,name
,&sz
) ==B_OK
)){cam_load_module
(name
); }close_module_list
(ml
);
This snippet of code allows the SCSI bus manager to iterate over all
available modules that have names starting with busses/scsi
. The
function cam_load_module()
actually does a
get_module()
on the current
module and sees if it registers itself with the SCSI bus manager or not.
The close()
function in the sample driver that appeared in my article last
week contains an error. In addition to providing a correction, this
erratum includes a full explanation of the device open/close protocol.
When user code invokes system call open()
, a new "session" begins with a
call to device open()
. This results in the creation of a cookie (if you
desire), and a file descriptor with reference count 1.
A subsequent call to system call dup()
increments that ref count.
Creating a new team with system call load_image()
or fork()
duplicates
all the parent's file descriptors for use by the child, raising the ref
counts accordingly.
The ref count of a file descriptor is decremented on the explicit call of
system call close()
, or the implicit system call close()
when a team
exits. A system call close()
of a file descriptor with ref count 1 will
trigger the device close. For a team with two or more threads, that final
system call close()
can create a subtle situation: one thread is blocked
or running in one of the device
read
/write
/ioctl
functions, when the
other thread calls device close.
The sole purpose of device close is to encourage blocked calls—of the same session—to unblock and go home.
Once the driver is free of all other callers for a given session, driver
shutdown can proceed to the final stage: disabling the hardware, removing
interrupt handlers, and freeing kernel resources. Your driver performs
these chores in the device free function. The kernel guarantees that the
call is single-threaded with respect to this session. Of course, you must
remain vigilant for the threads of other sessions, and protect shared
data appropriately: an example is ocsem
protecting nopen
.
The original device write function had a race condition between
acquire_sem_etc()
and has_signals_pending()
,
causing wbsem
to miscount.
has_signals_pending()
is a function internal to the kernel, hence it is
not documented and it should not be used. Instead, the revised code uses
the return status of acquire_sem_etc()
: now the semaphore is unchanged
should an error occur. Not shown here is a mechanism to unblock I/O
threads: you'll want one for drivers with slow, blocking I/O. Even fast
devices would benefit if you expect lost interrupts or other erroneous
hardware behavior.
static status_tqq_close
( void *v
) { return (B_OK
); } static status_tqq_free
( void *v
) { struct client *c
=v
; struct device *d
=c
->d
;acquire_sem
(d
->ocsem
); if (--d
->nopen
== 0) { (*isa
->write_io_8
)(d
->ioport
+MCR
, 0); (*isa
->write_io_8
)(d
->ioport
+IER
, 0);remove_io_interrupt_handler
(d
->irq
,qq_int
,d
); }release_sem
(d
->ocsem
);free
(v
); return (B_OK
); } static status_tqq_write
( void *v
, off_to
, const void *buf
, size_t *nbyte
) { cpu_statuscs
; struct client *c
=v
; struct device *d
=c
->d
; uintn
= 0; while (n
< *nbyte
) { status_ts
=acquire_sem_etc
(d
->wbsem
, 1,B_CAN_INTERRUPT
, 0); if (s
<B_OK
) { *nbyte
=n
; return (s
); }d
->wcur
= 0;d
->wmax
=min
( *nbyte
-n
,sizeof
(d
->wbuf
));memcpy
(d
->wbuf
, (uchar *)buf
+n
,d
->wmax
); (*isa
->write_io_8
)(d
->ioport
+IER
,IER_THRE
);acquire_sem
(d
->wfsem
);n
+=d
->wmax
;release_sem
(d
->wbsem
); } return (B_OK
); }
In my last column, I provided a simple function called CopyFile()
.
Unsurprisingly, this function copies a file under the BeOS, including not
only the "ordinary" data of the file but also any attributes that the
file may include. This week, I'll extend the function to attempt to
discern whether there is enough space on the destination volume for the
file before actually performing the copy. The source code for this
extended version of CopyFile()
is available on the Be FTP site at this
URL:
<ftp://ftp.be.com/pub/samples/storage_kit/CopyFile.zip>
Let's look at the new function prototype first:
status_tCopyFile
(const entry_ref&source
, const entry_ref&dest
, void*buffer
=NULL
, size_tbufferSize
= 0, boolpreflight
=false
, boolcreateIndices
=false
);
There are two new arguments since last time: preflight
and
createIndices
. The first of these specifies whether or not to analyze
the source file to determine whether it will fit on the destination
volume; the second indicates whether the copy routine should ensure that
file attributes which are indexed on the source volume are also indexed
on the destination volume.
I'll discuss indices a little later; first let's look at preflighting. In
CopyFile.cpp
there's a function called
preflight_file_size()
which
estimates the storage required for a given file. I'll walk through its
implementation briefly, starting with its prototype:
status_tpreflight_file_size
(const entry_ref&fileRef
, const fs_info&srcInfo
, const fs_info&destInfo
, off_t*outBlocksNeeded
);
First off, note that two of the arguments are fs_info structures. These
structures, obtained through the Storage Kit's fs_stat_dev()
function,
describe whole file systems. Not all BFS volumes are the same; in
particular, BFS supports a variety of fundamental block sizes. Storage on
a disk isn't continuous, it's divided into discrete units called
"blocks." Everything on a disk—file data, attribute data, and file
system control structures—occupies an integral number of blocks. Even
more A given file's data will consume more or fewer disk blocks depending
on what the destination volume's block size is. For this reason, the
preflight operation calculates the file's storage requirements in terms
of blocks, not bytes.
Calculating the number of blocks consumed for the file's ordinary data is easy:
struct statinfo
; off_tdiskBlockSize
=destInfo
.block_size
;BFile
file
(&fileRef,O_RDONLY
);file
.GetStat
(&info
); off_tblocksNeeded
= 1 + (off_t)ceil
(double(info
.st_size
) / double(diskBlockSize
));
BFile::GetStat()
returns a variety of useful information about a file; in
this case we only care about its data size. We then count how many blocks
the data will require, rounding up. The extra 1 is because there is a
one-block file system control structure called an "inode" for every file.
The inode is where the file system stores various information about the
file such as when it was last modified, what user "owns" it, whether it's
write-protected, etc.
In the case of a destination file system that does not support
attributes, we're done counting blocks. However, the usual case of
copying files onto BFS volumes is more interesting; attribute storage
under BFS is complex. Proceeding through the code we see that
preflight_file_size()
checks whether the destination volume supports
attributes by verifying that the B_FS_HAS_ATTR
flag is set in its fs_info
structure, then enters the following loop:
off_tfastSpaceUsed
= 0; charattrName
[B_ATTR_NAME_LENGTH
];file
.RewindAttrs
(); while ((err
=file
.GetNextAttrName
(attrName
)) ==B_NO_ERROR
) { const size_tFAST_ATTR_OVERHEAD
= 9; const off_tINODE_SIZE
= 232; attr_infoinfo
;file
.GetAttrInfo
(attrName
, &info
); off_tfastLength
=info
.size
+strlen
(attrName
) +FAST_ATTR_OVERHEAD
; if (fastSpaceUsed
+fastLength
<diskBlockSize
-INODE_SIZE
) {fastSpaceUsed
+=fastLength
; } else {blocksNeeded
+= 1 +off_t
(ceil
(double
(info
.size
) /double
(diskBlockSize
))); } // index calculations; see below }
The principle of iterating over the source file's attributes is one we saw last time, but the rest of the code deserves some explanation. BFS uses two different schemes for storing attributes: a "fast" scheme in which attributes are actually stored in leftover space within the file's main inode, and a slower scheme once that space fills up. The above code calculates how much fast-area storage would be required to hold the attribute, then checks to see whether there's enough fast-area space available. If so, the attribute's storage requirements are added to the fast-area tally, otherwise a separate attribute inode and attribute data blocks will be used for it. In that case, the storage requirement calculation is the same as for the file's data portion.
The nine-byte FAST_ATTR_OVERHEAD
constant is calculated based on the
scheme that BFS uses for storing attributes in the fast area. The
overhead is 4 bytes for the attribute's type, 2 bytes for a name-length
indicator, 2 bytes for the attribute's data length, plus 1 byte for a
trailing NULL
at the end of the attribute's name. Similarly, the current
size of the BFS inode structure is 232 bytes; whatever is left over in
the inode block is available for fast attribute storage.
There is one more feature of the Be file system that complicates storage requirement estimation, and that's the concept of indexed attributes. BFS can be instructed to maintain an index of all files that contain a particular named attribute; this index then allows the file system to search for files whose attributes match various criteria by scanning the indices rather than having to scan all files. This is the secret of the Tracker's blazingly fast "Find..." capability. The cost of this feature is disk space: the file system has to duplicate the attribute data within the index structures on disk.
To estimate the amount of extra storage needed for indexed attributes, the preflight routine keeps track of the cumulative size of all indexed attributes within the attribute-scanning loop:
// index calculations struct index_infoindexInfo
; if (!fs_stat_index
(srcInfo
.dev
,attrName
, &indexInfo
)) {indexedData
+=min
(info
.size
, 256); }
then, after all attributes have been examined:
blocksNeeded
+= 2 *indexedData
/diskBlockSize
+ 2;
Indexed attribute data is duplicated within the volume's indices, up to a maximum of 256 bytes per attribute. We keep track of the total amount of data that will be added to indices by the copy operation, then estimate its storage cost as twice the number of blocks necessary to hold the data contiguously, plus a small amount to account for indexing of non-attribute data such as file names and modification times. The exact number of blocks required cannot be determined accurately because it depends on the state of the indices prior to the copy operation. Double the minimum is an educated guess based on the observation that, on average, the blocks within a given index structure under BFS tend to be about half-full.
It bears repeating that these are *estimates*, not precise calculations. This code may overestimate by a few blocks the amount of storage that will actually be consumed by the copy operation. That's as close an estimate as possible, since certain aspects of the file system -- particularly directory and index management—are somewhat nondeterministic. Because conservative estimates are safe, this is an acceptable inaccuracy for our purposes.
Now that I've shown you what's involved in predetermining the amount of
storage required for a file, a word of caution is in order. Especially
for files with large numbers of attributes, this preflighting function
will be SLOW. One of the slowest operations in any file system is the
"stat" function, examining a file's vital statistics. Under BFS, because
attributes are practically files in themselves, getting attribute
information via BNode::GetAttrInfo()
is just as bad as BFile::GetStat()
for performance. Note that the Tracker itself doesn't use this intricate
procedure to preflight disk requirements for a copy operation; it uses a
simpler, more conservative heuristic instead—one which doesn't involve
getting info on all attributes. But if you know that you're copying files
with large amounts of attribute data, or large numbers of files (such as
an installer program might do), the more accurate estimation that I've
presented here might be worth the trouble.
[3] BeOS CAM Non-compliance (for nit-pickers)
When I said that the SCSI bus manager adhered to CAM, I lied. Our implementation diverges from the spec in several important ways:
The Execute SCSI IO request is Synchronous (no callback or polling as defined in the spec). You get to wait.
You must lock all memory that will be involved in a SCSI data transfer and should do so BEFORE allocating the CCB.
We currently ignore the no-disconnect, sync/no-sync, and tagged-queueing flags. Our SIMs will negotiate for the best protocol and speed they can handle.