I've spent way too much time over the last four years reading and writing frame buffer memory. Admitting that I have this problem is, for me, cleansing. Almost spiritual. Let me share my enlightenment. And if you're doing rendering using the Game Kit, or working with off screen bitmaps, you might even find my confessions interesting.
The first thing that you realize when you use a fast machine (like my beloved 225 MHz PowerPC) is that PCI is a pain. This thing is so slow compared to the speed of the CPU that thinking a little bit about the way you will access memory can have some nice paybacks.
For the next few examples, let's assume that the frame buffer is in 24-bit mode and has a width of 800 pixels.
The easiest way to draw a square 128 pixels wide is:
#defineBUFFER_BASE
0x14000020 #defineROWBYTE
(800*4) voidtest1
() { ulong *p
; ulongmy_color
; intx
; inty
;my_color
= 0x00ff0000; // a nice red color. for (y
= 0;y
< 128;y
++) {p
= (ulong*)(BUFFER_BASE
+ (y
*ROWBYTE
)); for (x
= 0;x
< 128;x
++) { *p
++ =my_color
; } } }
On my machine with a Twin Turbo video card, this piece of code runs in 3550 usecs. This is a bandwidth of about 18 MB/sec. If we run the same test using conventional memory, the code now runs in only 730 usecs -- about 5 times faster!! Excuse me, I have to get the phone.
I'm back. Moral #1: When you have a LOT of rendering to do, it's faster to render in an off screen bitmap and then blit the final result to the screen. This is even more apparent if you try to READ stuff out of the buffer. If we change the previous code into
for (y
= 0;y
< 128;y
++) {p
= (ulong*)(BUFFER_BASE
+ (y
*ROWBYTE
)); for (x
= 0;x
< 128;x
++) { *p
++ ^= 0xffffffff; } }
The execution time now goes up to a hugely 15100 microseconds, 4.5 times slower than the previous write only case. Say it not, reading the contents of a PCI frame buffer is a bad idea! In real memory, it takes only 800 microseconds. As you can see, the off screen method is the clear winner.
Back to the simple writing case. It turns out that writing doubles into the frame buffer helps the performance of the PCI transaction:
voidtest1
() { double *p
; doubletemp_double
; ulongmy_color
; intx
; inty
;my_color
= 0x00ff0000; // a nice green color. *((ulong*)&temp_double
) =my_color
; *(1 + (ulong*)&temp_double
) =my_color
; for (y
= 0;y
< 128;y
++) {p
= (double*)(BUFFER_BASE
+ (y
*ROWBYTE
)); for (x
= 0;x
< 128/2;x
++) { *p
++ =temp_double
; } } }
This one runs in 1970 usec—about 50% better than the one using 32 bits transfer!
If we unroll the loop...
for (x
= 0;x
< 128/8;x
++) { *p
++ =temp_double
; *p
++ =temp_double
; *p
++ =temp_double
; *p
++ =temp_double
; }
...we don't actually gain anything. This runs in exactly the same time as the non-unrolled version. This is because any overhead between the write instructions is hidden by the time taken to do the write. If you do some computation between writes, you may find that it is free. For example:
for (x
= 0;x
< 128/2;x
++) { *p
++ =temp_double
;my_color
+= (x
<<24) | (x
<< 8) ^x
; *((ulong*)&temp_double
) =my_color
;my_color
+= (x
<<24) | (x
<< 8) |x
; *(1 + (ulong*)&temp_double
) =my_color
; }
It looks busy. It is busy. But this thing still runs in EXACTLY the same time—so much for optimization! (By the way, this random piece of code looks very nice if you run it a few thousand times, try it!)
Note that although the double write only needs to be aligned on a 4-byte boundary, you should stick with 8-byte alignment. 4-byte alignment carries an 80% performance penalty.
A few nice tricks
When doing graphic-intensive operations in 32-bit mode, you may find some of these functions useful. They're a collection of tricks to speed up common blending operations. Have fun decoding them!
The first one blends two RGB values...
The trivial implementation would be :
ulongcalc_blend
(ulongcolor1
, ulongcolor2
) { ulongresult
;result
= ((1 + ((color1
>> 24) & 0xff) + ((color2
>> 24) & 0xff)) >> 1) << 24;result
|= ((1 + ((color1
>> 16) & 0xff) + ((color2
>> 16) & 0xff)) >> 1) << 16;result
|= ((1 + ((color1
>> 8) & 0xff) + ((color2
>> 8) & 0xff)) >> 1) << 8;result
|= ((1 + ((color1
) & 0xff) + ((color2
) & 0xff)) >> 1); returnresult
; }
The fast version is:
ulongcalc_blend
(ulongcolor1
, ulongcolor2
) { return ((color1
& 0xFEFEFEFE)>>1)+ ((color2
& 0xFEFEFEFE)>>1)+ (color1
&color2
& 0x01010101L)); }
A fast color addition with clipping to 0xff:
ulongcalc_add
(ulongc1
, ulongc2
) { return ((((((c1
^c2
)>>1)^((c1
>>1)+ (c2
>>1))) & 0x80808080L)>>7)*0xFF)|(c1
+c2
); }
Now subtraction:
ulongcalc_sub
(ulongc1
, ulongc2
) {c2
^= 0xFFFFFFFFL; return ((((((c1
^c2
)>>1)^((c1
>>1)+(c2
>>1))) & 0x80808080L)>>7)*0xFF) & (c1
+c2
+1); }
By the way, thanks to Pierre for some of these tricks!
You know how you read magazine articles that start, "By the time you read this..."
Well, it's Thursday before our developer's conference and I'm sitting here making up things that will sound OK by the time you read this next week. How about starting with things we do know.
We now have accelerated 3D support for OpenGL®!! How the heck? Well, start with the Diamond Monster3D board, add the current 2.4 glide library support from 3Dfx, and mash on the freely available Mesa OpenGL® implementation, and there you have it. I'm telling you, you have done 3D until you've done 3D with a nice cheap hardware accelerator.
ftp://ftp.be.com/pub/dr9/samples/glide.zip
ftp://ftp.be.com/pub/dr9/samples/mesafx.zip
This will work with any of the 3Dfx Voodoo based graphics cards, not Voodoo Rush at the moment. Go buy one of these cards (~$200), plug it into your Mac, and have at it. If you are at the developer's conference, you'll see this in action, if not, you'll just have to wait.
Speaking of the conference, what else will you see? Well, since you won't see this until next week anyway, you'll see a whole bunch of other nifty graphics stuff, as well as a glimpse of what I fondly call WetTV. Not to dwell on my favorite Bt848 subject, but I'm programming and I can't stop!! Have you ever watched television where channel changes have movie transitions? You will need a towel to run this application because after you do, you will find that you have peed your pants with excitement. Bold statements? Yes of course, this stuff is kicking and you won't see it on XYZ operating systems because those programmers simply aren't as motivated. They're too busy poo pooing how little chance we have of succeeding.
My brother has this favorite little statement, "We're going to go eat from the big dog's bowl while he's not looking."
So here we are. Some software available. Some interesting hardware support, including a new processor, and hungry agile programmers who are willing to take advantage of superior technology. Thanks for all your support.
So what kind of support do you get from us? Let me tell you a story. We were at the local sandwich shop waiting for our food. Geoff had just come from the local hardware store where he had bought some stain or some such. He was trying to balance the can on his head, but lacking enough hair, it didn't stay too well and ended up on the floor. Boy that stuff spreads fast!! He and one of the other customers quickly mopped it up while Brian and I quickly got our food and distanced ourselves from the scene. To redeem himself, he did the 3Dfx support.
That's the kind of tech support personnel we have around here! A little fool hardy, but we can program up a storm when the need arises. I hope you all benefit from the fruits of our labors.
Six months ago, we exited the hardware business. We loved our BeBox very much, but we love to create opportunities for BeOS developers even more. Our software running on Power Mac compatibles was warmly received. As a result, BeOS developers could see a much broader installed base than the one provided by BeBoxen. Apple, Power Computing, Motorola and Umax were very helpful in making this possible.
Once it became clear we were in the business of adding value to popular hardware, the next logical step didn't require much thought. We claimed and proved the portability of our OS; porting it to Intel Architecture systems made sense. At a time when our engineering resources were stretched by our work on the major improvements in the Preview Release, Intel helped us to get started by providing engineers who, for a while, came to work in our cramped cubicles. Their contribution to this important project is gratefully acknowledged: We wouldn't be this far along without their excellent work.
Now, such a move raises many new questions. I'll answer a few today, leaving the rest to other columns—or BeWeek columnists.
First, does this put us more squarely in competition with Microsoft? In other words, are we even crazier than previously perceived?
Crazy, perhaps, but not suicidal. Actually, as one of our co-founders, Steve Sakoman, remarks, one must be crazy to do something original, as opposed to derivative. But not all craziness is productive. What do we have to gain by competing more directly with Microsoft?
Let's start by noting that, more and more, when you write a line of C++, or Java code, you could be competing with Microsoft, whose strategy could be summarized by one word: Everything.
But universality has its drawbacks. Windows 95 is an excellent general purpose desktop OS. Windows NT is a holy terror in the enterprise market. Are we going to be flattened by these two steamrollers? For us, the idea is to exist to the left or right of them, not in their path. Put another way, our focus is the digital media content creation space.
The situation opposes a dedicated tool, the BeOS, versus respected general purpose platforms. Some developers and users will prefer the benefits of specialization, others will pick the general purpose platform. Historically, this leads to 75-25%, or 80-20% situations.
Let's continue by noting many advanced PC users already run more than one OS. Popular software tools called boot managers provide for such coexistence: Windows NT offers its own, there is the extremely successful System Commander and Lilo, a very nice Linux utility. We're proud of our work, we see incredible potential in our OS, but the logical consequence of specialization is coexistence with general purpose products, as opposed to attempting to displace them with a (yet) unproven OS such as ours.
Second, does this mean we are abandoning the PowerPC? Again, no. Why should we do this at the very moment our OS could run on most personal computers? As we have said in the past, we are processor agnostic.
Agnostic, and hopeful. Comparing the performance between Intel-based PCs and PowerPC systems, we see unrealized potential in the PowerPC space. In the Intel market, competitive forces have honed many parts of the system, chip sets, bus, memory, disks, graphic accelerators...
As a result, with roughly equivalent Pentium and PowerPC processors, system performance tends to be superior on the Intel Architecture side. It's not always pretty, but advances such as USB and FireWire are about to remove many scars from the past—and it is fast and inexpensive. A system based on an Intel dual Pentium Pro motherboard, with high-speed SCSI storage, Ethernet, sound, nice video, etc., can be had for about $2,500, monitor included.
On the other hand, until recently, the Mac market has been deprived from the competitive forces which make hardware subsystems more efficient and less expensive. This is where we see an opportunity for the PowerPC. There is a chance the much awaited CHRP will finally become a reality. If it does, an active Mac clone industry will finally actualize the "power" in PowerPC.
I wrote above we were processor agnostic and hopeful. We aren't blind either. Apple is still struggling with its licensing dilemma. As everyone else, I've read the New York Times story reporting Apple Board's statement to clone makers they were, in essence, no longer wanted. I hope the NYT was misinformed but I'm struggling with the knowledge John Markoff, the reporter, is well connected and very careful.
We'll see. In the mean time, we have to take care of our business, which is to expand opportunities for BeOS developers. That's what we are doing with the Intel Architecture version.