Scheduler work progress
Thanks to generosity of Haiku supporters, I will be able to continue my work on scheduler in November. It’s high time I wrote a report about what has already been done. As it was mentioned before my work can be tracked in the scheduler branch in my repository. Commit descriptions and some comments in the scheduler code contain more detailed motivation behind some of the decisions I had to make. In this post, though, I will concentrate how my work looked so far and what I plan to do next.
I started with machine dependent part - detecting CPU and cache topology. I didn’t need that information until recently, but I wanted to have low level code written before I moved on to reworking the scheduler. CPU vendors make it a bit more complicated than one could expect since AMD and Intel provide topology information in different ways. Moreover, in case of both AMD and Intel their older processors report such data in different way than the new ones, so actually there are four different means of detecting CPU topology. Currently, Haiku recognizes three levels of CPU topology: package (i.e. socket), core and SMT. However, it was implemented in such way that adding another level in the future will be quite trivial.
Once, I was done with the low level stuff I started working on simple scheduler and improving its performance on single processor machines. I had two main goals make it O(1) with regard to the number of ready threads (i.e. all scheduler operations take constant time regardless of how many threads there are) and to prevent thread starvation. In order to achieve constant complexity I used a priority queue created from 121 (because we have 121 possible thread priorities) lists. To improve performance I used a bitmap that allows the scheduler to quickly find the highest priority thread.
Solving thread starvation problem was more tricky. When the thread uses up it whole time slice without voluntarily giving up the processor (being pre-empted by higher priority thread doesn’t count) its effective priority is decreased. This fixed the problems like #8007 since no thread (except real-time thread which are not affected by these changes) could for an extended time prevent lower priority threads from executing. There was still another problem. When two CPU bound threads earned their maximum penalty they had both effective priority reduced to 1, hence relation between their priorities was lost. That wasn’t right, CPU bound thread with initial priority e.g. 20 should be given more processor time than CPU bound thread with initial priority e.g. 5. I managed to deal with that problem by introducing a limit how much thread priority can be reduced, which depends on the initial thread priority. After reaching this limit the thread priority cycles between 1 and the limit. This means that the thread still can’t starve anyone but is given more CPU time than lower priority threads.
Then, I could concentrate on multiprocessor systems. First I chose to add support for systems with all logical processors sharing all levels of cache (e.g. single core processor with HyperThreading). Since the caches are shared the scheduler don’t have to care about cache affinity. That’s why single, shared run queue can be used what greatly simplifies load balancing. I introduced a priority queue of logical processors, implemented using heap, thus making the scheduler O(log n) with regard to the number of CPUs, which is used to track priorities of the threads the processors are executing. When a thread becomes ready its priority is checked against the minimal priority of a running thread and if it is higher the running thread is pre-empted.
Having done that I finally started working on affine scheduler. It uses a separate run queue for each core so that the threads rarely change the core they are running on (actually, at the moment they aren’t changing at all since thread migration is not ready yet). When new thread (that haven’t been running on any core yet) arrives the scheduler decides to which core run queue it should be enqueued. If there are idle cores one of them is chosen. Which one - it depends on the current scheduler mode of operation. There are currently two modes of operation which can be changed at any time: performance and power saving. When in “performance” mode the scheduler tries to wake as many packages as possible first, then starts waking subsequent cores on each package. It is motivated by the fact that when only one core in the package is active it has whole shared levels of cache for itself and it can enter boost mode that increases its frequency above the standard level. However, when in the power saving mode the scheduler does the opposite thing - tries to keep as many packages as possible idle and first wake cores on the packages that are active anyway. Moreover, in both modes cores to wake up are chosen in LIFO manner i.e. the core that has been idle for a shorter time is woken first. That helps both performance and power saving.
That’s pretty much what I have done so far. The affine scheduler still needs quite a lot work before it will be complete. Choosing core for a thread when there are no idle ones is not completed yet, there is no thread migration that would ensure proper load balancing, there is still a lot to do to reduce power consumption. Also, because the affine scheduler doesn’t use global run queue it could help with lock contention. Unfortunately, Haiku uses gSchedulerLock
which acts like a big kernel lock and allows only one CPU to enter the scheduler at any given time. It is also used heavily by mutex implementation, what makes things even worse. I plan to replace this giant lock with more fine grained locking once affine scheduler is in more “ready to use” state.