The main purpose of the engineering articles is to deliver simple and accurate technical information that's meaningful to as many people as possible. Thus, simple articles are better than complex ones—a rule I follow as often as I can.
But not today.
With apologies to the average developer who is never going to deal with the subtleties I'm about to discuss, I just can't stop myself! After many general information articles on various subjects, I felt an irresistible urge to write some strong technical content. You've been warned, now continue at your own risk.
At the heart of the BeOS graphic system lies the application server. It's been the theater of continuous cleaning, reorganization and optimisation during the last 12 months, and more will happen, at least until Release 6.
Recently, we discovered a nasty interaction between the app server and the kernel's thread scheduler. This problem had very serious consequences, and solving it has brought a significant performance improvement to Release 4. It also pointed out an interesting class of problems, that some of you could encounter in the future (if not already).
So in this article, we are going to study different mechanisms you can use to synchronize threads, and try to describe their behavior under well-defined stress conditions. For that we used one or two "worker" threads, executing the same reference task again and again. They had to use a synchronization mechanism to guarantee that only one thread executed the task at a time (from now on, the reference task will be called the "critical section").
To simulate the CPU noise generated by other unrelated threads, we sometimes added an additional dummy thread. The dummy thread didn't vie for the critical section—in other words, it didn't interfere directly with the other two threads.
All measurements were done on a Pentium II 350MHz in conditions as stable and similar as possible, with most of the servers, applications and services disabled, but with some daemons left running. In this environment, all code that we executed was always available in the level 2 cache (if not in the level 1!). In the real world, of course, this isn't a realistic assumption, but as you will see, studying the perfect case is hard enough already...
In this test, one B_NORMAL_PRIORITY
thread repeatedly executed a 6.1us
("us" = "microsecond") task that's protected by a simple semaphore:
while (true
) {acquire_sem
(g_sem
);do_critical_section
(6.1);release_sem
(g_sem
); }
We ran the loop for about two seconds and recorded the latency (time consumption) for each lock and for each unlock. We then plotted the results on separate graphs (i.e. one for locking, another for unlocking). These graphs represent a very rich source of information, and we're going to be studying them throughout most of this article.
The graph's horizontal axis is latency in microseconds; the vertical axis is the hit count, or the number of locks/unlocks at a particular latency. Both axes are exponential and elided where necessary (to fit the format of an email, and reduce the length of this article). All graphs are supposed to be viewed with a monospaced font.
This first graph shows the locking latency:
-------------One thread, semaphore, locking------------- 32768| A | AA 16384| AA | AA 8192| AA | AA 4096| AA | AAA . ... | AAA 128| AAA | AAA B 64| AAA BB | AAA BB 32| AAA BBBB | AAA BBBB 16| AAA BBBB | AAA BBBBB C 8| AAA BBBBB C | AAAA BBBBB C 4| AAAA BBBBB CC | AAAA BBBBB CC 2| AAAA BBBBB # CCC # # | AAAABBBBBB ## ## CCC # # ### # # +!-------!-------!-------!-------!-------!-------!-------!--...--!-- 0 8 16 32 64 128 256 512 2048
The (A) spike shows the case where the semaphore is acquired without blocking. This happens 99.6% of the time with an average latency of 3.5us.
In the (B) spike, a scheduler interruption occurs while acquire_sem()
is executing, but without rescheduling (i.e. acquire_sem()
continues
after the interruption). Probability: 0.3%; average latency: 8.5us. So
both (A) and (B) represent the same phenomena, except that (B) is
delayed 5.0us by the scheduler interruption. This delay is only a part
of the more generic "scheduler noise," and so it doesn't represent any
relevant information. In further comments, we will call that phenomena
"scheduler echo"; we will identify it when needed, but we won't spend
any more time commenting about it.
(C) is another component of the noise, weaker than the "scheduler echo," but with an higher latency. It's probably the signature of a daemon (most likely a real-time priority thread) that runs regularly (from example 60 times a second) for a short amount of time (15 to 20us). Once added with the two necessary context switch latencies, that leaves us with what we will call the "daemon echo." In this case it's an "echo" of (A), with an addi- tional latency of 27us.
What's left—the scattered "#"s—is the rest of the noise generated by either the scheduler or some other daemons whenever they used some CPU. Although the noise delay can reach 3000us, this is actually reasonable since it reflects quite accurately the maximum scheduling quanta. This contains even less relevant information than the two previously described "echoes," so from now we won't pay any attention to it.
Here's the graph for unlocking :
-------------One thread, semaphore, unlocking------------------ 32768| A | AA 16384| AA | AA 8192| AA | AA 4096| AA | AAA . ... | AAA 256| AAA | AAA B 128| AAA BB | AAA BBB 64| AAA BBBB | AAA BBBBB 32| AAA BBBBB | AAA BBBBB 16| AAA BBBBB C | AAA BBBBB C 8| AAA BBBBB CC | AAA BBBBBB CC 4| AAAA BBBBBB # CCC | AAAA BBBBBB ## CCCC 2| AAAABBBBBBB ## CCCC ## # # # | AAAABBBBBBB### CCCCC ## ### # # # +!-------!-------!-------!-------!-------!-------!-----...--!----- 0 8 16 32 64 128 256 2048
(A) Again, non-blocking acquisition is the overwhelming winner: 99.06% and 3.5us.
(B) "scheduler echo" of (A), has a higher probability (0.86%) than in the analogous locking test, but that's because the unlocking measurement includes the noise associated with both the unlocking and the critical section. This is an artifact of measuring the locking and unlocking with only two timestamps :
while (true
) {TIME_STAMP
();acquire_sem
(g_sem
);TIME_STAMP
();do_critical_section
(6.1);release_sem
(g_sem
); }
This allowed us to reduce the impact of time-measurement on the system, as we can just subtract the latency of the critical section from the second latency to get the unlocking latency. But there was no way to avoid increasing the magnitude of the noise. Still, it's only noise -- as long as we can identify it (and ignore it), we don't care about its magnitude.
(C) is the "daemon echo" of (A).
Round trip: More than 99% of the time, a trip through the entire loop (acquire the semaphore, perform the critical section, release the semaphore), takes about 13.1us (3.5 acquisition, 6.1 critical section, 3.5 release).
We replaced the semaphore with a benaphore. You can look at one of the previous Engineering Insights articles for more detailed information about the benaphore: Be Engineering Insights: Benaphores.
while (true
) { if (atomic_add
(&g_lock
, 1) > 0)acquire_sem
(g_ben
);do_critical_section
(6.1); if (atomic_add
(&g_lock
, -1) > 1)release_sem
(g_ben
); }
-------------One thread, benaphore, locking------------- 65536|A |A 32768|AA |AA ... |AA 16|AA B |AA BBB 8|AA BBB |AAA BBB 4|AAA BBB |AAA BBBBB C 2|AAA BBBBB CC |AAA BBBBB CC # +!-------!-------!-------!-------!-------!-- 0 8 16 32 64 128
As expected, the probability and timing are much improved:
(A) represents the main and fast case of the benaphore (when we don't have to take the semaphore at all), with an outrageously good probability of 99.96%. Second good news, as expected the latency is down to 0.3us (12x faster than the semaphore).
(B) is the "scheduler echo" of (A). Its probability is very low because (A) runs for such a short time that it's rarely interrupted by the scheduler.
Same thing for (C), a very weak "daemon echo" of (A).
-------------One thread, benaphore, unlocking------------- 65536|A |AA ... |AA 256|AAA BB |AAA BBB 128|AAA BBB |AAA BBB 64|AAA BBBB |AAA BBBB 32|AAA BBBB |AAA BBBBB C 16|AAA BBBBB CC |AAA BBBBB CC 8|AAA BBBBB CC |AAA BBBBBB CC 4|AAA BBBBBB CCC |AAA BBBBBBB CCC # 2|AAA BBBBBBBBB CCC ## ## # # # |AAA BBBBBBBBBBB ## # CCCC ## ## ### ## # # +!-------!-------!-------!-------!-------!-------!-----...--!----- 0 8 16 32 64 128 256 2048
(A) is the logical mirror of the previous (A), with a similarly good latency of 0.3us.
(B) "scheduler echo" of (A), is stronger than the previous (B). This was expected, because we get the noise for both the unlocking and the critical section.
Same comment for both (C), the "daemon echo" of (A), and (#) the other noise.
Round-trip: Greater than 99% probability of a 6.7us total round-trip -- pretty good for a 6.1us critical section. But, then again, this is an ideal—and so unrealistic—test.
NOTE: In all our tests, the unlocking statistics are mostly similar to one or the other of the two already shown. Unlocking never blocks so, as you might expect, it doesn't really affect overall system performance. So as there isn't too much to be learned from those, we won't show any unlocking graphs for any of the other tests, but will only make some comments when necessary.
Let's see what happens when we introduce a second B_NORMAL_PRIORITY
thread. We'll have two equivalent threads fighting over the lock. First,
we use a semaphore:
-------------Two threads, semaphore, locking------------- 16384| A . . | A 2048| AA | B AA 1024| BB AA . .. .. 256| BB AA C | BB AA CC 128| BBB AA CC | BBB AA CC 64| BBB AA CC . ... ..... 16| BBB AACCC D | BBB AACCCC D 8| BBB AACCCC DD # # | BBB AACCCC DD # # 4| BBB # AACCCC DDD # # # | BBB # AACCCC DDDD # # # # 2| BBB # # AACCCCC DDDDD # ## # # | BBB # ## ##AAAACCCCCDDDDDD ## ## # ## # # +!-------!-------!-------!-------!-------!-------!-----...--!---- 0 8 16 32 64 128 256 2048
NOTE: This is a graph of only one of the threads. The profile of the other thread is very similar, as expected, with the same spikes in term of both probability and latency:
(A) : 24.138us (85.677%) against 24.168us (85.285%) (B) : 3.545us (11.741%) against 3.543us (12.159%) (C) : 29.059us ( 2.303%) against 28.878us ( 2.236%) (D) : 52.718us ( 0.137%) against 51.378us ( 0.157%)
So from now on, we won't comment about the second thread at all.
(A) The most common case is no longer unblocked acquisition. Instead, the threads are starting to "ping pong": 85.6% of time, the thread must wait to acquire the semaphore, with a latency of 24.1us. That time includes the first thread blocking, a context switch, the second thread unblocking, the second thread executing the critical section, the second thread blocking, another context switch and finally our first thread unblocking. That also implies that the context switch takes about 5us, a very impressive number, but not very realistic as in this case everything clearly happens in the level 1 cache.
(B) is the unblocked acquisition spike. The latency is logically the same as in the one-thread test (3.5us), but the probability has dropped to 11.7%. This is easy to explain: whenever the scheduler suspends both threads in a state where no one owns the semaphore, whatever thread starts first is free from any synchronization issue, as if it was alone. And that lasts until the scheduler suspends it again, which most likely puts the system back into "ping pong" mode.
(C & D) are the "scheduler echo" and "daemon echo" of (A).
Round-trip: 24.1 + 6.1 + 5.0 = 35.1us for 7/8 of the iterations. The "ping pong" clearly affected the performance, but it's not a total disaster compared to the 13.1 * 2 (don't forget, now we have 2 threads working in parallel!) = 26.2us of the semaphore.
Mostly out of curiosity, let's see what happens when we use a benaphore in the two threads context.
-------------Two threads, benaphore, locking------------- 16384| A . . | A 2048|B AA |B AA 1024|BB AA |BB AAA 512|BB C AAA |BB C AAA 256|BB CC AAA D ... .. ... . 32|BB CC AAADD |BBCCC AAADD 16|BBCCC AAADD E |BBCCC AAADDD E 8|BBCCC AAADDD # E |BBCCC AAADDD # EE 4|BBCCC AAADDD #EEE |BBCCC # AAADDD #EEE # # 2|BBCCCC # AAAADDD #EEE # ### # |BBCCCC ## AAAAADDDD #EEE ####### # # ## ## +!-------!-------!-------!-------!-------!-------!-----...--!---- 0 8 16 32 64 128 256 2048
(A, D & E) The ping pong mode and its two "echoes" comprise 82% of the iterations with a latency of 24.7us (+5.0us and +27.0us for the echoes).
(B) Non-blocking acquisition comprises 14.3% at 0.31us (a result similar to the previous non-blocking acquisition for two threads using a semaphore, but just faster as we have a benaphore now).
(C) This aberration is the result of a race condition in which the
benaphore becomes a semaphore: Thread A sets the benaphore counter (bc)
to 1 and enters the critical section. Thread B comes along, sets bc to
2, and heads for acquire_sem()
—BUT—the scheduler switches to
thread A before thread B gets to actually invoke acquire_sem()
. Thread
A leaves the critical section, decrements bc to 1 and calls
release_sem()
. Since no one is actually waiting for the semaphore
(since thread B never got that far), thread A sets the semaphore
counter to 1. After that, thread A increments bc to 2 and call
acquire_sem()
. Since the semaphore has already been released, thread A
acquires it without blocking. Then it passes through the critical
section, decrements bc to 1, releases the semaphore... and so on until
the scheduler decides it's B's turn. In other words, all those
acquire_sem()
and release_sem()
calls were unnecessary, but they don't
really hurt all that much (it's still better than playing ping pong).
As expected, in ping pong mode the benaphore is slightly slower than the semaphore (24.7us vs. 24.1us), but the non-blocking case is still much faster (0.31us vs. 3.5us). The unlocking test (not shown) also satisfies this expectation.
Round-trip: A little bit better than the semaphore. The improvement in non-blocking performance makes up for the degraded ping pong. In real life, it's not clear that we would get a similar behavior, but in any case, even if the benaphore gets slower than the semaphore, it won't be by much.
When two threads have to fight for the ownership of a shared critical section, performances can be significantly reduced. The snappy little benaphore degenerates into a semaphore, and the CPU efficiency is reduced by a small factor—the two-thread tests are 1.4 to 2.5 times less efficient than the one-thread tests. This would seem bad enough already, but who knows what's going to happen once we move those tests from a ideal and isolated environment into a more realistic one. Clearly a story for next week...
Harlan Ellison once noted that first impressions are deadly, dangerous, and deceiving. Since this is my debut Newsletter article, I'm aware of what's at stake. I've had my hair cut. I've shaved my face raw and put a styptic pen in my pocket for the nicks. I'm wearing that same collared shirt, spiffy tie, and cotton slacks that almost lost me the job when I interviewed here. All this so that you'll walk away from this column with that warm, fuzzy, BeOS-coding kind of feeling. So much for introductions.
An odd phenomenon occurs after midnight on summer weekends here in the land of suburban sprawl. In eerie emulation of the coastal grunion runs, hordes of Hollywood zombies issue forth from their movie theater coffins. At that time of night activities are scarce—absent certain chemical and hormonal addictions, and/or a zoot suit and the irrepressible urge to dance.
For the rest of us there's the 24-hour delicatessen. Some go for the food; some go to avoid the grunion. Me, I head to the deli to marvel at one thing: a laminated, folded sheet of cardboard stock which, when unfolded, stands two feet high and three feet wide. This item is not just a menu—it's a symbol of the bewildering diversity of American culture, of the consumer's RIGHT TO CHOOSE. Confronted by this formidable selection, I forego exercising my birthright and always get the corned beef Reuben and the vanilla Coke.
So what has that got to do with the BeOS, you ask? The connection is menus. These are not just the cornerstone of the contemporary GUI; they also show off BMessages and object- oriented design to good effect. And, they provide a safe, wholesome springboard to the wonderful world of the Interface Kit (see the foreword on the Importance of First Impressions).
To whet your appetite for menus, I've concocted an example of an application that tweaks menus a bit more than the average Be application. Appropriately enough, it's called MenuWorld. Go ahead, download it:
ftp://ftp.be.com/pub/samples/interface_kit/MenuWorld.zip
The following notes boil down the essence of this example.
Ninety percent of menu behavior is defined in four classes:
BMessage
:
A command that is executed when a menu item is selected.
If you're unfamiliar with BMessage
, check The Be Book Application Kit
documentation:
A romantic relationship awaits you.
BMenu
:
A subclass of BView
that displays a pop-up or pull- down
menu. The most commonly used functions here are AddItem()
,
AddSeparatorItem()
, and RemoveItem()
.
Note that AddItem()
takes either a
single item or a whole submenu. You've seen hierarchical menus before;
you know the drill. Also note that BMenu
s appear and disappear in
pop-up BWindow
s, which makes an interesting flourish to apply to other
pop-up UI widgets—such as tool tips.
BMenuBar
:
A subclass of BMenu
that specifically displays a
pull-down menu bar. This is often added to a window in the window's
constructor, and automatically sizes itself quite nicely.
BMenuItem
:
Represents one drawable item in the menu. Although it's
not actually a view (it uses the BMenu
it's contained in for drawing),
it's capable of invoking a command, as any decent BInvoker
can. This
is an excellent candidate for subclassing if you've got some custom
item drawing to do (and I did; see the MenuWorld
class BitmapMenuItem
).
There are three different kinds of menus as far as drawing goes: column-drawn (standard), row-drawn (e.g., menu bars), and custom-drawn (called "matrix," though there's no restriction on how the items are arranged). You specify which one you're using when you create the menu. MenuWorld proudly presents all three methods.
If you're arranging menu items in a column, things couldn't be easier. The menu works with your menu items to figure out how big to make the menu, and how to arrange the items within the menu. The menu also takes care of drawing submenu triangles and shortcut symbols. All you need to do is create your menu items and add them to the menu. Working with row-by-row organization is similar.
If you're doing a custom arrangement of items, things get more interesting. Here are some points to consider:
You need to calculate the bounds of the menu, and the frame of each item in
the menu. This can't be done directly with standard
BMenuItem
s, so you'd better be ready to derive your
own menu item subclass, and provide some ability to calculate the item's
frame based on either its contents, the menu's dimensions, or both. (Note
that blindly calling the menu item's
GetContentSize()
function won't work—it's
protected, and can only be called by the BMenu
class, not you!) In MenuWorld, the
menu displays icons, whose sizes are well defined,
so this is not difficult. Each icon is represented by the simple class
BitmapMenuItem
, and the class
TestMenuBuilder
does the actual work of building the
menu items and menu.
The padding that would be inserted around normal BMenuItem
s in a
column arrangement to take care of the submenu triangles and shortcuts
is noticeably missing in a row or custom arrangement. So, if you want
those submenu wedges and shortcuts, or the hefty left and right
margins that these imply, you're going to need to cook them up
yourself.
Stuffing your menus with items is easy, and can be done in several places:
For many purposes, it's enough to add the items once at the
beginning, generally when your window is being built. The MenuWorld is built up this way
(see MenuWindow
::BuildFileMenu()
).
Simple item changes traceable to a single action can be done at any time. For instance, when the MenuWorld, clicking on the check box Hide User Test Items swaps the menu bar that contains the user menu items for a menu bar which doesn't.
button is clicked, the window immediately adds the menu to the menu bar. Similarly, when the button is pressed while a top-level menu is selected, the window immediately removes the menu from the menu bar. You can easily extend this approach to swap in and out different menu bars from the window as needed (the OLE paradigm of menu-document relations comes hurtling to mind, but try doing this in MFC under any other circumstances!). In
If you have a fixed set of menus in the menu bar, you may want to
defer touching up the menu items until the menus are actually being
shown (i.e., in BWindow
::MenusBeginning()
). Here you can enable/disable
items, or even completely build up the contents of menus. If your
menus have especially volatile contents, or the interactions that
cause menu items to be enabled or disabled are particularly
complicated, this is a great way to go. MenuWorld demonstrates the
approach by dynamically building up the contents of each
user-specified menu in MenuWindow
::MenusBeginning()
.
Since BMenuItem
s inherit from
BInvoker
, you can pull many of the same
tricks with them as you can with controls and other invokers. In
particular, your messages can be arbitrarily complex, and you don't need
to dispatch them to the main window. For instance, the
->
menu item dispatches to the global application object. Also note that
each user item contains the same message ID, so they all get handled in
the same place (a separate field identifies the actual name of the item
which sent the message); the test items are set up in a similar manner.
Again, try achieving that same kind of flexibility with
MFC!
There is a price to this flexibility, however. Assuming MenuWorld were a
full-featured menu editor, let's say we wanted to save the menus we've
created and use them in our application, à la Visual C++ menu resource
editing. The Be way to do this is to archive the Menu Bar and its
constituents into a BMessage
, then write the flattened
BMessage
data to a
BResource
s object.
The problem, which ties in to a recent missive on BeDevTalk, is that the
BMenuItem
s do not store their targets—indeed, they have no inherent
way to identify a target once they've been archived and reconstituted.
One thing you can do pretty easily is to set the target of each menu item
to the main window when you instantiate them. However, if you want to set
special BMenuItem
targets in MenuWorld,
coming up with an archiving
mechanism is non-trivial. Some third-party apps on BeWare, such as
AppSketcher and Interface Elements,
address this problem:
http://www.be.com/beware/Development.html#Cat_Interface%20Creation
That's it for this week. I leave you with a German blessing: "Froehlich sei das Codeverbessern; guten Appetit!"
BeDevTalk is an unmonitored discussion group in which technical information is shared by Be developers and interested parties. In this column, we summarize some of the active threads, listed by their subject lines as they appear, verbatim, in the mail.
To subscribe to BeDevTalk, visit the mailing list page on our web site: http://www.be.com/aboutbe/mailinglists.html.
The opening parry regarded the proper seeding of the random number
generator: Call srandom()
(or srand()
)
only once, and before any random()
(rand()
) calls. The safest bet is to set the seed in
main()
.
Other details popped out:
The high bits are more random than the low bits. If you want to restrict the range of the random number, you should divide rather than mask (or mod).
If you want a random number that's bigger than 16 bits, combine
multiple random()
calls.
random()
is more random than rand()
.
But is it truly random? Some
folks think it's about as random as you can get. Others say that
because the implementation varies between compilers, you shouldn't rely
on random()
. Instead, you should find a good algorithm and roll it
yourself.
How do you put an object in a shared area so more than one application
can use it? For example, how do you share a BBitmap
? As asked by Ingo
Weinhold:
“The only way I see is to archive the BBitmap
object and flatten this
archive to the area. [However,] each application has to unflatten and
reinstantiate the object, which won't save any memory...”
“So let's assume it's possible...other applications could clone the area, but the logical addresses may differ. An object that has stored pointers to address its data will get into trouble if its logical address changes. Correct?”
First things first: No, you can't share an object (not safely and predictably, anyway). As for creating an object at a specific address, Jon Watte reminds us that C++ provides a "placement" allocation.
Announcing a grass roots movement to petition Apple to release G3 specs to Be. The devil's advocate response from Geoffrey Clements (which, read literally, depicts a bizarre anatomy):
“Who cares, really? Yeah a G3 running the Mac OS will kick a Pentium II running Windows's ass. But a dual Pentium II running the BeOS will Kick the G3's ass.”
Yes, but with a proviso:
“BeOS is a faster operating system, so long as it doesn't get klunkier in each release. Its been rather noticeable that BeBoxen that could zip speedily along in DR8.2 (BeOS in 1996) are almost unusably slow in R3. I'm speculating that one of these slowdowns is due to the MIME typing file handling system.”
The proposed solution: Make MIME-typing optional. "What?!?" quoth Chris Herborth:
“How could the filesystem's MIME file attributes slow down the display and applications in R3? The filesystem in R3 is _orders_of_magnitude_ faster than the one in DR8. And it wasn't the only thing that changed between DR8 and R3.”
“If the file typing is made optional, we'll have the same useless filesystem as Windows and UNIX; our files won't know what they are, and we'll have to rely on stupid file extensions for everything.”
A spawn of the BeOS on G3 thread, listeners discussed system performance. Is it degrading? Is it the app server's fault? MIME types? Anti-aliased fonts? After some discussion
THE BE LINE: (From George Hoffman).
“The graphics team is well aware of the state of app_server performance. Not all of the perceived slowdown from PR2 to R3 is the fault of the app_server, but there are hard numbers which do indicate a slight slowdown in graphics performance. ... One of the big priorities we've had for R4 is performance, and I'm pleased to say that the R4 development app_server is quite a bit faster than R3, and, more importantly, _feels_ faster.”
(Also, see Pierre Raynaud-Richard's article, above.)
Delesley S. Hutchins is having trouble with Be's (ahem) streamlined debugger:
“...I can't get the debugger to work... I type "help", and nothing happens... it doesn't even go back to the debugger prompt.”
From the scept'red isle, that other Eden, demi-paradise, royal throne of kings and Tinky Winky, Simon Clarke suggests, moddishly:
“Do you have
export BEDEBUG=true
set in your user setup environment file? I had the same problem if I went into the debugger from the alert that a crashed app shows. However, with the debugger on all the time it works fine.”