Implementing Synchronization Primitives for Boost on Windows Platforms
by Anthony Williams
Introduction
The project to rewrite the Boost thread library started in 2005, as part of the drive to have the whole of Boost covered by the Boost Software License; Boost.Thread was under a different (though similar) license, and the original author could no longer be contacted. Though this issue has now been resolved, as the original author was contacted and agreed to the license change, efforts were underway to reimplement the library, and they continue unabated. The reimplementation effort is driven in part due to the proposals for threading to be added to the next version of the C++ Standard, scheduled to be released in 2009.
This article describes the Windows implementations of mutexes for boost in CVS on the thread_rewrite branch at the time of writing. As discussions continue, and alternative implementations are proposed, the final version used in Boost release 1.35 may differ from that described here.
What's a mutex
Mutex is short for "mutual exclusion", and they are used to protect data — only one thread is permitted to "own" the mutex at one time, so only one thread is permitted to access that data at once, the rest are excluded. The other important job performed by a mutex is to ensure that when a thread owns the mutex, the values it sees for the protected data are exactly the same as the values seen by the last thread to own the mutex, including any writes made by that thread. This is particularly important on systems with multiple CPUs, where the threads might be running on different CPUs, and the data will reside in the associated CPU caches for some indeterminate amount of time without specific instructions to ensure cache coherency, and visibility of the data between CPUs.
Native Windows Synchronization Objects
Windows provides an assortment of synchronization objects: CRITICAL_SECTION
s,
Semaphores, Events and Mutexes. Though all of them can be used to ensure that
only one thread at a time has access to a given block of data, CRITICAL_SECTION
s
and Mutexes are the only ones designed for the purpose.
A CRITICAL_SECTION
is a relatively light-weight object, and can only be used
within a single process. Use of a CRITICAL_SECTION
requires use of the
CRITICAL_SECTION
-specific API. In contrast, the other synchronization objects
are kernel objects, and can be used to provide synchronization between
processes, as well as within a single process. All these objects can be used
with the generic HANDLE
-based APIs, such as WaitForSingleObject
and
WaitForMultipleObjectsEx
.
Atomic Operations
As well as the synchronization objects described above, the Windows API provides
a set of atomic operations, commonly implemented as compiler intrinsics that
compile down to a single processor instruction. The most basic of these is the
InterlockedExchange
API, which atomically swaps two 32-bit values, such that all
CPUs and cores in the system will see either the original value, or the new
value.
This property is common to all the atomic APIs (all of which share the
Interlocked
prefix), which include increment, decrement, add, and
compare-and-exchange. The latter comes in both 32-bit, and (on Windows Vista,
Windows Server 2003, and 64-bit platforms) 64-bit variants, and is the basic
workhorse of thread synchronization — all other operations can be built on
it.
The semantics of these basic atomic operations also include the full memory barrier semantics necessary for synchronization between threads running on different CPUs — all writes to memory that occur in the source code before the atomic operation will be completed (and made visible to other CPUs) before the atomic operation, and all reads that in the source code after the atomic operation will not proceed until the atomic operation has complete.
Newer versions of Windows such as Windows Server 2003 and Windows Vista also
provide additional variants of some of the atomic operations that only provide a
write barrier (the xxxRelease
variants), or a read barrier (the xxxAcquire
variants), but these are not used for the implementation under discussion.
Choosing a mutex implementation
The simplest way to implement a mutex would be to write a wrapper class for one of the native Windows synchronization objects; after all, that's what they're there for. Unfortunately, they all have their problems. The Mutex, Event and Semaphore are kernel objects, so every synchronization call requires a context switch to the kernel. This can be rather expensive, especially so when there is no contention.
Boost mutexes are only designed for use within a single process, so the
CRITICAL_SECTION
looks appealing. Unfortunately, there are problems with this,
too. The first of these is that it requires explicit initialization, which means
it cannot reliably be used as part of an object with static storage duration —
the standard static-initialization-order problem is compounded by the potential
of race conditions, especially if the mutex is used as a local static. On most
compilers, dynamic initialization of objects with static storage duration is not
thread-safe, so two threads may race to run the initialization, potentially
leading to the initialization being run twice, or one thread proceding without
waiting for the initialization being run by the other thread to complete.
The second problem is that you can't do a timed wait on a CRITICAL_SECTION
,
which means we need another solution for a mutex that supports timed waits,
anyway.
There is also another problem with using CRITICAL_SECTION
s as a high-performance
mutex, which is that a thread unlocking the CRITICAL_SECTION
will hand-off
ownership to a waiting thread. More on this below.
The need for speed
The primary reason for using anything other than a Windows Mutex kernel object is the need for speed — otherwise, the ease of implementation when using a Mutex object would make it an obvious choice. The fastest way to lock a mutex is using atomic operations — do an atomic exchange of the "locked" flag to set it, and check the old value; if it was already locked, this thread has to wait, otherwise it has the lock — but the problem comes with the "this thread has to wait" part. A waiting thread has to consume as near to zero CPU time as possible until the mutex becomes unlocked, in order to not slow down the running threads.
long flag; void lock() { while(!InterlockedExchange(&flag,1)) { wait(); } } void unlock() { InterlockedExchange(&flag,0); wake_a_waiting_thread(); }
The simple answer, which is the one used by the Windows CRITICAL_SECTION
, is to
use a Windows auto-reset Event to handle contention. When a thread tries to lock
the mutex, and finds it already locked, then it blocks on the event. When the
thread that owns the lock unlocks it, if there is a thread blocked on the Event,
it signals the event to wake the waiting thread. This also handles contention
and priority neatly, in that Windows will wake exactly one thread waiting on an
auto-reset Event, and the kernel can choose the highest priority thread to wake.
The hard part here is the "if there is a thread blocked on the Event" part. Windows won't tell us, so we have to maintain a count of waiting threads in the mutex. We could just set the Event every time, rather than keeping a count, but that would lead to a potentially-unnecessary kernel API call on unlock, and would also cause the first thread to enter the wait loop afterwards to wake immediately, as the Event would stay set. This spurious wake up would just consume CPU for no gain.
Keeping Count
The simplest way to keep count is just to increment the flag rather than always set it to 1. If we just incremented it to 1, then we're the only thread, so we've got the lock. Otherwise, someone else has the lock, so we must wait.
On unlock, we can decrement the count. If it hits zero, it was just us, so we do nothing, otherwise we must wake another thread, and that thread now has the lock:
void lock() { if(InterlockedIncrement(&flag)!=1) { wait_on_event(); } } void unlock() { if(InterlockedDecrement(&flag)!=0) { wake_a_waiting_thread(); } }
Hand-off
This is essentially how CRITICAL_SECTION
s work, and is a nice simple
scheme. Unfortunately, it suffers from a problem called "hand-off". When the
current thread unlocks the mutex, if there are any other threads waiting, it
"hands off" the mutex to one of these even if that thread is lower priority
than the unlocking thread. If the mutex is being locked and unlocked in a loop,
the high priority thread may come round to the lock again, before the lower
priority thread (which now owns the mutex) wakes up. In tight loops, this is
quite likely, since the high priority thread won't yield to the lower priority
thread until it either blocks or hits the end of its timeslice. This means that
the high priority thread now has to wait for the lower priority thread, even
when it could have continued.
To solve this, we need a different scheme, where the mutex is truly unlocked by
unlock()
, however many waiting threads there are.
Avoiding hand-off with compare-and-swap
The solution I've implemented is to reserve one bit of the flag for the "locked" state, whilst keeping the remaining bits for the count of "interested" threads. Consequently, this can no longer be implemented as a single atomic instruction, since it is not a simple addition, or even a simple mask. This is where compare-and-exchange/compare-and-swap (CAS) comes into its own: by using CAS in a loop, you can do any operation you like to a single 32-bit field (or 64-bit field, on those platforms that support it). CAS only sets the field to the new value if it was equal to the specified "old value", but it returns the old one, whatever it was. That way you know whether your set succeeded, and you can decide whether to try again, and what the new value should be based on the actual current value.
Consequently, try_lock()
looks like this:
bool try_lock() { long old_flag=0; do { long const new_flag=(old_flag+1)|lock_flag_value; long const current_flag= InterlockedCompareExchange(&flag,new_flag,old_flag); if(current_flag==old_flag) { return true; } old_flag=current_flag; } while(!(old_flag&lock_flag_value)); return false; }
Which basically says: take the current value, increment the "interested threads" count and set the lock flag. If we succeed in this, we've got the lock. If not, the state must have changed, so try again, unless another thread has the lock, in which case we're done.
Plain lock()
is essentially a try_lock()
in a loop, waiting on an event. The
first time through, we increment the "interested threads" count, whether or not
we get the lock, to mark that we're waiting. If the lock flag was set already,
then we wait on the event. When we wake, we just try and set the lock flag. If
it was still set, we wait on the event again.
bool lock() { long old_flag=0; while(true) { long const new_flag=(old_flag+1)|lock_flag_value; long const current_flag= InterlockedCompareExchange(&flag,new_flag,old_flag); if(current_flag==old_flag) { break; } old_flag=current_flag; } while(old_flag&lock_flag_value) { wait_on_event(); do { long const new_flag=old_flag|lock_flag_value; long const current_flag= InterlockedCompareExchange(&flag,new_flag,old_flag); if(current_flag==old_flag) { break; } old_flag=current_flag; } while(!(old_flag&lock_flag_value)); } }
In contrast, the unlock()
code is quite straight-forward, since the modification
to the flag value is simple: clear the lock flag, and take one off the
count. Since we know the lock flag is always set, this is a simple subtraction,
which we can do with the atomic exchange-and-add operation. If the old value
returned by the exchange-and-add shows there was another thread waiting, signal
the event to wake it.
void unlock() { long const offset=lock_flag_value+1; long const old_flag_value=InterlockedExchangeAdd(&flag,-offset); if(old_flag_value!=offset) { wake_a_waiting_thread(); } }
Timing out
Adding a timeout to the lock function is relatively straight-forward with this
scheme. We need to add a timeout to the wait_on_event()
call, and if it times
out, then we decrease the count of "interested threads" and return false to
indicate that we haven't got the lock. If the wait doesn't time out, then we
proceed just like lock()
, returning true if we do get the lock.
bool timed_lock(timeout_type timeout) { long old_flag=0; while(true) { long const new_flag=(old_flag+1)|lock_flag_value; long const current_flag= InterlockedCompareExchange(&flag,new_flag,old_flag); if(current_flag==old_flag) { break; } old_flag=current_flag; } while(old_flag&lock_flag_value) { if(!wait_on_event(milliseconds_remaining(timeout))) { InterlockedDecrement(&flag); return false; } do { long const new_flag=old_flag|lock_flag_value; long const current_flag= InterlockedCompareExchange(&flag,new_flag,old_flag); if(current_flag==old_flag) { break; } old_flag=current_flag; } while(!(old_flag&lock_flag_value)); } return true; }
The only issue here is the fact that the timeout on the wait and the decrement of the count are not a single atomic operation. Consequently, if this thread is the only waiting thread, and the current owner unlocks the mutex between the wait timing out and the flag being decremented, then the event will be signalled, even though there are no waiting threads. This will cause a spurious wake-up of the next thread to wait on the mutex, which is unfortunate, but not a disaster. The alternative, which is that rather than just decrementing the count, we try and acquire the lock, and only decrement the count if we don't get it, also has the potential for spurious wake-ups. If the timing-out thread is not the only waiting thread, but the mutex is unlocked between the timeout and the decrement, then another of the waiting threads will wake. If we acquire the lock instead of just decrementing the count, then the other waiting thread will have woken for no reason.
Events and Initialization
Up to now I've glossed over the wait_on_event()
and wake_a_waiting_thread()
calls, but they're quite important to the whole scheme. They're also rather
simple:
bool wait_on_event(unsigned milliseconds=INFINITE) { return WaitForSingleObject(get_event(),milliseconds)==0; } void wake_a_waiting_thread() { SetEvent(get_event()); }
Since we're using an auto-reset event, the event object is automatically reset
when the first waiting thread wakes from the WaitForSingleObject call. The
complication is get_event()
. In order to permit static initialization, our mutex
cannot have a constructor, so the members have to be initialized with fixed
values — this was one of the problems with CRITICAL_SECTION
s. Therefore, we
need to create the event object in the get_event()
call, in such a way that if
multiple threads call get_event()
concurrently, they all return the same event
object. We do this with yet more atomic operations — first we read the current
Event handle; if it's NULL
, then we create a new Event, and try and swap it into
place. If we succeeded, then we're using the Event we created, otherwise another
thread beat us to it, so we use that, and destroy the one we created.
HANDLE event; HANDLE get_event() { HANDLE* current_event= InterlockedCompareExchangePointer(&event,0,0); if(current_event) { return current_event; } HANDLE* new_event=CreateEvent(NULL,false,false,NULL); if(!new_event) { throw thread_resource_error(); } current_event= InterlockedCompareExchangePointer(&event,new_event,0); if(!current_event) { return new_event; } else { CloseHandle(new_event); return current_event; } }
Initialization is thus straight-forward: for objects with static storage duration, zero initialization will suffice, and for objects with non-static storage duration, explicit initialization to zero using an aggregate initializer will suffice.
void f() { static mutex sm; // zero init works OK for static mutex m={0}; // aggregate initialization required for auto static mutex sm2={0}; // aggregate init works OK for static, too }
This is important, as any use of a constructor would require dynamic initialization, which occurs at runtime, and therefore may be subject to race conditions. This is particularly a problem for objects with static storage duration at block scope, since the constructor is run the first time control passes through the definition of the object; if multiple threads are running concurrently, it is possible for two threads to believe they are "the first", and thus both run the constructor. It is also possible that one thread will start running the constructor, and another thread will then proceed before the constructor has completed.
Cleaning up
Since such a mutex uses a Windows Event object, it needs to tidy up after itself, otherwise you have a resource leak, which is less than ideal.
For automatic and dynamic mutexes, clean-up is easy: the destructor should destroy the Event. For statics, clean-up is a bit more complicated.
Destructors for objects with static storage duration are called in the reverse order of their construction. Unfortunately, this is not necessarily the best order, especially for block-scope statics. Block-scope statics have the additional problem of a potential race condition on "the first time through", which can lead to the destructor being run multiple times on popular compilers, which is undefined behaviour — compilers often use the equivalent of a call to atexit in order to ensure that destruction is done in the reverse order of construction, and the initialization race that may cause the constructor to be run twice may also cause the destructor to be registered twice. If any threads continue after main, this problem is compounded, as the unpredictable destruction order means that the mutex may be accessed after it has been destroyed, again leading to undefined behaviour.
We do know, however, that Windows Objects allocated by a program are freed when the program exits, so for objects of static storage duration, we can get by without a destructor, which neatly sidesteps the "access after destruction" and "multiple destruction" issues. It does make it hard to use the same class for static objects and non-static objects, though.
Copping Out
The current boost spec requires that instances of the mutex type are usable without an explicit initializer. This means that it is not possible to satisfy the requirements for both static and non-static storage duration without resorting to compiler-specific techniques, or some form of "named mutex" technique that doesn't require storing state within the mutex.
The current implementation of boost::mutex
cops out, and has a default
constructor and a destructor. This makes it dangerous to use as a block-scope
static, but yields correct behaviour for objects with non-static storage
duration, and objects of namespace scope, provided care is taken with
initialization order.
Though this is the same as for previous boost releases, this situation is less than ideal, however, and the search continues for a way of ensuring correct initialization in all cases.
It has been suggested that boost adds a static_mutex
, which would then be
portable to other platforms, but this is not available at this time, and
building a safe-for-all-uses mutex would be preferable.
The basic_timed_mutex
used in the Windows implementation has no constructor or
destructor, and could therefore safely be used with static storage duration as
described above. Use at non-static storage duration requires calling the init()
and destroy()
member functions in place of the constructor and destructor. This
class is a detail of the current thread_rewrite Windows implementation, however,
and its use is not therefore recommended.
Generalizing to Other Platforms
Whilst the implementation described here is Windows-specific, the majority of
the code is just using the InterlockedXXX
functions. These functions are
generally just compiler intrinsics for single CPU instructions, so could easily
be implemented on non-Windows platforms with a bit of inline assembly, or by
replacing them with the equivalent calls to OS functions.
The larger bit of non-portability comes from the use of Event objects. These are
essentially just binary semaphores, so can easily be replaced by semaphores on
those systems that support them (e.g. POSIX systems). POSIX semaphores are not
quite ideal, though — they would have to be dynamically allocated using new
or
malloc
in get_event
, and they aren't limited to just set/unset, so there is the
potential of spurious wake-ups. This wouldn't prevent the scheme described from
working, since it allows for spurious wake-ups, but it would be less than
ideal. A condition variable could be used instead, but that also has the
potential for spurious wake-ups, and might have more overhead, since it requires
use of an OS mutex in addition.
Future Plans
Work is still under way to identify a solution to the initialization and clean up problems. Under the new C++ Standard, initialization may be easier, as there is a proposal under consideration for generalized constant expressions, which would enable simple default constructors to be used for static initialization, and thus solve the race conditions due to initialization. Unfortunately, this does not solve the corresponding clean up problems. There are also proposals under consideration to address thread-safe initialization of static objects.
In any case, the new C++ Standard won't be published before 2009, and it will then be a few years before compilers supporting it become common, so this is still an important issue.
Conclusion
Implementing synchronization primitives is hard. Ensuring they are correct is hard enough, but fairness issues and initialization problems just make the whole thing harder.
Thankfully, most of the time we can just rely on libraries such as Boost, and not have to think about the issues. It does mean that as the implementor of such a library it is even more important to get things right.
As a library user, understanding the issues involved can help us to see the reason behind particular restrictions the library places on us, and can help us write better code.
Acknowledgements
Thanks to Peter Dimov and Alexander Terekhov who pointed out the hand-off problem to me, and suggested using a swap-based method to avoid it. Thanks also to Roland Schwarz for reviewing the code, and proposing alternative initialization and implementation techniques.
Design and Content Copyright © 2005-2025 Just Software Solutions Ltd. All rights reserved. | Privacy Policy