If you look at the BeOS R4 datasheet <http://www-classic.be.com/products/beos/beos_datasheet.html> you'll find many bold keywords that describe the best features of our modern OS -- virtual memory being one of them. On one hand, virtual memory is certainly a great feature, as it makes involuntary cross-application memory corruption impossible, and lets you virtually increase the amount of usable memory by putting extra data on a hard drive, in what's called a "swap file".
From my point of view, though, this great feature has one big drawback: it makes a typical careless developer think that the memory available is almost limitless.... Why should he care about optimizing memory usage, since modern computers ship with more and more memory, and there will always be still more virtual memory available on a multi-gigabyte hard drive?
By not caring about memory usage, a developer buys his application a one-way, first class ticket to performance hell. First, virtual memory may be huge, but it's certainly not fast. At one end, regular RAM supports bandwidth up to x100MB/s and responds to a random request in a fraction of microsecond. At the other end, an average hard drive won't even reach 10MB/s, and worse yet, it will need many milliseconds to reply to a random request. In practice, once you start "swapping" actively, it's common to see average performance become 10 or even 100 times slower. For a short operation, the performance impact can even reach ratios as high as 1,000 or 10,000 times slower. You don't want that to happen to your application, do you?
But even if you don't reach such extreme—sadly not so uncommon -- slow-downs, using more memory is still a bad idea. Modern processors have become so fast during the last 10 years that RAM technology hasn't been able to keep up the pace. That's why all modern processors don't access the main memory directly, but go through multiple levels of memory caches. The L1 (or level 1 cache) is the closest to the processor, and also the fastest, but because it's extremely expensive, it's usually rather small (typically 32K or 64K). Beyond that, most processors have an L2 (level 2 cache), slower than the L1 but faster than the main memory. L2 is still expensive, but not as much as L1; its typical size is around 256K - 1MB. In real life, both of those caches are far to small to contain all the data commonly used by a program, so your machine will spend a big chunk of its time exchanging data between L1, L2, and main memory. And this is often where your application's performance bottleneck will be.
So if you want some simple advice, try to keep the memory footprint of your application as small as possible. Here are a few very simple things you can do to achieve that:
When allocating integer arrays, pick the smaller integer type that fits your need. For example, if you're going to store values between 0 and 199, don't just use an int or int32 by default. A uint8 will do nicely, and will use four times less memory.
When you define structs and classes that could be allocated in huge numbers, try to chose the integer types in a similarly efficient way, and order them sensibly. A simple way to do that is to always put the longest types first. For example, compare these 2 structs:
struct _poor_order_ { int8a8
; doubled64
; int8b8
; int32a32
; int16a16
; }; struct _good_order_ { doubled64
; int32a32
; int16a16
; int8a8
; int8b8
; };
gcc will tell you that
sizeof
(_poor_order_) == 24;sizeof
(_good_order_) == 16;
Guess which one you want to use...?
When you need large buffers, think twice before taking the laziest path, which is to allocate those buffers as much space as first seems to be necessary. Just take a few minutes to ask yourself these simple questions:
Do you really use all those bits of information?
For example, the following bool array:
bool flags1
[1024*1024]; // 1 MB
can be successfully replaced by:
uint8 flags8
[128*1024]; // 128 KB
By just modifying your code like this:
if (flags1
[index
] ==true
) -> if ((flags8
[index
>>3]>>(index
&7)) == 1)flags1
[index
] =true
; ->flags8
[index
>>3] |= 1<<(index
&7);flags1
[index
] =false
; ->flags8
[index
>>3] &= 0xff-(1<<(index
&7));
The flags8 code will be slower, but if you don't use it a lot, it will reduce your memory footprint so much that it will most likely be worth it.
Do you really need all that data?
For example, if you have an array of int32 that you only use in two places, like so:
int32datas
[1024*1024]; ...datas
[index
] =value
; ... if (datas
[index
] > 100000)
You may want to optimize:
boolgreater_than_100000
[1024*1024]; ...greater_than_100000
[index
] = (value
> 100000); ... if (greater_than_100000
[index
])
This is already four times smaller. And then you can also use the previous trick and get it 32 times smaller...
Do you really need a buffer that big? Hmmm... count again, maybe by changing the way you order datas in your array, you can make it quite a bit smaller. Definitively worth it.
Do you really need a new buffer? Maybe you already have another buffer where you're storing temporary data, and you'll never need them at the same time... so why not use it again?
Beyond that, there a few less simple things you should think about:
When you use a ring-buffer or multiple buffer, try to make them as small as possible. Do you really need a triple- buffer? Wouldn't a double-buffer do the same job with a smarter glue code?
Beware of excessive parallelism. Multi-threading is great but it has its limits. For example, if you need to apply four filters on a buffer stream, you can do it by applying them in one thread, one after another, and in that case you may be able to just use one buffer per frame, and maybe a few more for the I/O queue. If you try to use one thread per filter, you'll need buffer queues between each thread (so that you uncouple them and allow them to potentially run in parallel on a quadriprocessor machine). But by doing that, you may easily end up with a dozen buffers, many of them simultaneously used by your four threads. As a result, your L2 cache will be trashing all the time on a single CPU machine. The single threaded version could end up being quite a bit faster in some cases, and the memory footprint would certainly be much smaller.
In a more general way, it's usually a bad idea to cut a high-bandwidth data-stream in too many pieces—especially on a single-processor machine (that is still the common case). An optimal solution could be dynamically adjusting the count of processing threads, based on the CPU count.
There's much more to be said about reducing memory usage, but this is only a newsletter article, not a book, so I'll add just one more thing. On a modern operating system, memory is always accessed through the CPU's MMU, which gives us an easy way to know exactly what memory is used (physical or virtual). In the case of the BeOS, you access that information through the "area" API (check: TheKernelKit_Areas.html for more details). The "listarea" shell command is the standard way to dump the current state of the memory mapping. It's full of interesting information, but it's a little difficult to use. So to make it easier for you to track the dynamic memory allocation behavior of your program, I wrote a little application, named "AreaWatch", which displays the same information in a more accessible way. You'll find it (with the source code and a detailed ReadMe) at <ftp://ftp.be.com/pub/contrib/develop/AreaWatch.zip>. It contains quite a bit of small nifty tricks, so don't hesitate to look at the ReadMe in details.
In the rush of working on Media Kit sample code, I thought I'd pause for
a moment and zoom in on a few simple classes that I've created for my
media applications. You might find them useful when writing your own
applications as well. These classes are MediaNodeWrapper
,
Connection
, and
NodeLoopManager
.
The MediaNodeWrapper
class is a class I'm using more and more to work
with media nodes. Its function is to wrap around a media_node structure,
and provides a simplified set of methods for issuing commands to the
node. These commands end up calling Media Roster functions to work on the
media_node. To make them easy to remember, there is a tight
correspondence between BMediaRoster
methods and
MediaNodeWrapper
methods.
One feature that I added to MediaNodeWrapper
was the ability to "lock"
certain nodes, such as the system mixer, that you as an application
shouldn't be messing with. Issuing commands to these nodes has no effect.
So, when I want to tell a node to start, I don't have to think about
whether this was a system node or one of my own—I simply lock the node
wrapper when I first create it, and then treat it like any other node.
This makes it a snap, for example, to start up a miscellaneous group of
nodes.
The facilities for connecting nodes in the Media Kit are quite rich, and allow you to do some pretty fancy stuff. However, in working with nodes, I noticed that I tended to follow one simple pattern when connecting nodes:
Get one output from the upstream node.
Get one input from the downstream node.
Connect them together using the upstream node's format, and remember the information for the connection.
Later on, disconnect them.
The Connection
class implements this connection pattern. You use
Connection::Make()
to attempt the connection. If the connection is
successful, you get a pointer to a Connection
object back. This
connection object combines the information you get from the media_input
and media_output structures into one cohesive structure. Finally, when
you're done with the connection, you simply delete the Connection
object.
Here's an example of MediaNodeWrapper and Connection in action. Let's say we had a node that we created, and we wanted to hook it up to the system mixer and start it. First, to connect and start two generic nodes, we'd do something like this:
Connection
*ConnectAndStart
(const MediaNodeWrapper&unode
, const MediaNodeWrapper&dnode
) { // Set upstream's time source to downstream's t.s.unode
.SlaveTo
(dnode
);Connection
*c
=Connection
::Make
(unode
,dnode
); if (c
) {BTimeSource
*ts
=unode
.MakeTimeSource
(); bigtime_ttpStart
=tp
->Now
() +unode
.GetLatency
();unode
.Start
(tpStart
);dnode
.Start
(tpStart
); returnB_OK
; } returnc
; }
Then, we use this to connect our node to the system mixer:
media_nodemyNode
,mixerNode
; ..MediaNodeWrapper
myWrap
(myNode
);MediaNodeWrapper
mixerWrap
(mixerNode
,true
); // lockedConnectAndStart
(myNode
,mixerNode
);
Up until now, the methods I've presented for looping play (i.e. automatically starting over when you reach the end of a file) have not been satisfactory for most applications. I implemented this by writing my own file reader node and altering its behavior to do this automatically. However, for people who are using external nodes (e.g. the system extractor node that you'll likely get if you sniff a media file), this approach won't work. We need a way to get any node to rewind to the beginning of a piece of media when it reaches the end, and to keep doing this until we've stopped the node.
One approach to this problem might be to simply queue up a long list of
Seeks to media time 0 at regular intervals. The idea here is that, every
time the node reaches the end of the file, it will process a Seek command
and start over. However, there might be nodes out there which can only
queue >one pending Seek at a time (remember, this is all that a node is
guaranteed to give you). So, a better approach is to send one Seek
operation at a time, and wait until that Seek()
has been completed before queueing up another one.
There are two questions that remain to be answered:
How long should the node play before Seeking? In the case of a file, we want to play the entire file before looping back to the beginning. This means that we need to know the duration of the media in the file to set up the Seek correctly. Fortunately, this duration is given to you when you call BMediaRoster::SetRefFor on a file interface node.
How do we know when the node has finished handling its Seek operation? One way would be to note the performance time that the Seek is supposed to happen at, and wait until we reach that performance time. But doing this is harder than it may at first appear, because you'll need to know the latency of the node to figure out when it will reach the desired performance time—and the latency may change at any time!
The failsafe way to do this is to have the node inform us when it's
reached a certain performance time. In R4.5, we allow you to do just
this: with BMediaRoster::SyncToNode()
. This method will simply block until
the node has reached the desired performance time.
Some of you node writers may be wondering how this works. Well, when the
application calls SyncToNode()
, the new method
BMediaNode::AddTimer()
gets
called on the node side, with the target performance time and a cookie
that the node must remember. When the performance time is reached, the
node calls BMediaNode::TimerExpired
with the
cookie. TimerExpired()
then
sends a notification back to the roster. (If your node derives from
BMediaEventLooper
, relax: we take care of all of
this for you.)
NodeLoopManager
is a class that manages looping play on the application
side of the fence. You create it and tell it the starting media time and
loop duration. Then when you call NodeLoopManager::Start()
, it sends Seek
commands periodically to the node, each time waiting until the previous
Seek()
is handled before sending the next one.
NodeLoopManager
spawns a
thread to do the waiting and looping, so that it doesn't interfere with
your application's operation. When you want to stop the node, use
NodeLoopManager::Stop()
so that the node stops looping as well.
You can see all three of these classes in action by perusing the new-and-improved Mix-A-Lot:
<ftp://ftp.be.com/pub/samples/media_kit/MixALot.zip>