Blog Archive for / 2008 / 06 /
Condition Variable Spurious Wakes
Friday, 27 June 2008
Condition variables are a useful mechanism for waiting until an
event occurs or some "condition" is satisfied. For example, in my implementation
of a thread-safe queue I use a condition variable to avoid
busy-waiting in wait_and_pop()
when the queue is
empty. However, condition variables have one "feature" which is a
common source of bugs: a wait on a condition variable may return even
if the condition variable has not been notified. This is called a
spurious wake.
Spurious wakes cannot be predicted: they are essentially random from the user's point of view. However, they commonly occur when the thread library cannot reliably ensure that a waiting thread will not miss a notification. Since a missed notification would render the condition variable useless, the thread library wakes the thread from its wait rather than take the risk.
Bugs due to spurious wakes
Consider the code for wait_and_pop
from my thread-safe
queue:
void wait_and_pop(Data& popped_value) { boost::mutex::scoped_lock lock(the_mutex); while(the_queue.empty()) { the_condition_variable.wait(lock); } popped_value=the_queue.front(); the_queue.pop(); }
If we know that there's only one consumer thread, it would be
tempting to write this with an if
instead of a
while
, on the assumption that there's only one thread
waiting, so if it's been notified, the queue must not be empty:
if(the_queue.empty()) // Danger, Will Robinson { the_condition_variable.wait(lock); }
With the potential of spurious wakes this is not safe: the
wait
might finish even if the condition variable was not
notified. We therefore need the while
, which has the
added benefit of allowing multiple consumer threads: we don't need to
worry that another thread might remove the last item from the queue,
since we're checking to see if the queue is empty before
proceeding.
That's the beginner's bug, and one that's easily overcome with a
simple rule: always check your predicate in a loop when
waiting with a condition variable. The more insidious bug
comes from timed_wait()
.
Timing is everything
condition_variable::wait()
has a companion function
that allows the user to specify a time limit on how long they're
willing to wait: condition_variable::timed_wait()
. This
function comes as a pair of overloads: one that takes an absolute
time, and one that takes a duration. The absolute time overload will
return once the clock reaches the specified time, whether or not it
was notified. The duration overload will return once the specified
duration has elapsed: if you say to wait for 3 seconds, it will stop
waiting after 3 seconds. The insidious bug comes from the overload
that takes a duration.
Suppose we wanted to add a timed_wait_and_pop()
function to our queue, that allowed the user to specify a duration to
wait. We might be tempted to write it as:
template<typename Duration> bool timed_wait_and_pop(Data& popped_value, Duration const& timeout) { boost::mutex::scoped_lock lock(the_mutex); while(the_queue.empty()) { if(!the_condition_variable.timed_wait(lock,timeout)) return false; } popped_value=the_queue.front(); the_queue.pop(); return true; }
At first glance this looks fine: we're handling spurious wakes by
looping on the timed_wait()
call, and we're passing the
timeout
in to that call. Unfortunately, the
timeout
is a duration, so every call to
timed_wait()
will wait up to the specified amount of
time. If the timeout was 1 second, and the timed_wait()
call woke due to a spurious wake after 0.9 seconds, the next time
round the loop would wait for a further 1 second. In theory this could
continue ad infinitum, completely defeating the purpose of using
timed_wait()
in the first place.
The solution is simple: use the absolute time overload instead. By specifying a particular clock time as the timeout, the remaining wait time decreases with each call. This requires that we determine the final timeout prior to the loop:
template<typename Duration> bool timed_wait_and_pop(Data& popped_value, Duration const& wait_duration) { boost::system_time const timeout=boost::get_system_time()+wait_duration; boost::mutex::scoped_lock lock(the_mutex); while(the_queue.empty()) { if(!the_condition_variable.timed_wait(lock,timeout)) return false; } popped_value=the_queue.front(); the_queue.pop(); return true; }
Though this solves the problem, it's easy to make the mistake. Thankfully, there is a better way to wait that doesn't suffer from this problem: pass the predicate to the condition variable.
Passing the predicate to the condition variable
Both wait()
and timed_wait()
come with
additional overloads that allow the user to specify the condition
being waited for as a predicate. These overloads encapsulate the
while
loops from the examples above, and ensure that
spurious wakes are correctly handled. All that is required is that the
condition being waited for can be checked by means of a simple
function call or a function object which is passed as an additional
parameter to the wait()
or timed_wait()
call.
wait_and_pop()
can therefore be written like this:
struct queue_not_empty { std::queue<Data>& queue; queue_not_empty(std::queue<Data>& queue_): queue(queue_) {} bool operator()() const { return !queue.empty(); } }; void wait_and_pop(Data& popped_value) { boost::mutex::scoped_lock lock(the_mutex); the_condition_variable.wait(lock,queue_not_empty(the_queue)); popped_value=the_queue.front(); the_queue.pop(); }
and timed_wait_and_pop()
can be written like this:
template<typename Duration> bool timed_wait_and_pop(Data& popped_value, Duration const& wait_duration) { boost::mutex::scoped_lock lock(the_mutex); if(!the_condition_variable.timed_wait(lock,wait_duration, queue_not_empty(the_queue))) return false; popped_value=the_queue.front(); the_queue.pop(); return true; }
Note that what we're waiting for is the queue not to be empty — the predicate is the reverse of the condition we would put in the while loop. This will be much easier to specify when compilers implement the C++0x lambda facilities.
Conclusion
Spurious wakes can cause some unfortunate bugs, which are hard to
track down due to the unpredictability of spurious wakes. These
problems can be avoided by ensuring that plain wait()
calls are made in a loop, and the timeout is correctly calculated for
timed_wait()
calls. If the predicate can be packaged as a
function or function object, using the predicated overloads of
wait()
and timed_wait()
avoids all the
problems.
Posted by Anthony Williams
[/ threading /] permanent link
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Comments Now Enabled - what would you like to see?
Thursday, 26 June 2008
I have now updated my blog engine to allow comments on my blog posts, so please give it a whirl.
To kick things off, please add a comment on this entry if there's something you'd like me to cover on my blog, and I'll pick the ones I feel able to write about as topics for future posts.
If you're viewing this post in an RSS reader, you'll have to actually go to the website to comment. If you're viewing this post on one of the blog directory pages, click on the title or follow the "Permanent Link" to get to the entry page.
Any comments I feel are inappropriate or spam will be deleted.
Posted by Anthony Williams
[/ news /] permanent link
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Exceptions make for Elegant Code
Friday, 06 June 2008
On this week's Stack Overflow podcast, Joel comes out quite strongly against exceptions, on the basis that they are hidden flow paths. Whilst I can sympathise with the idea of making every possible control path in a routine explicitly visible, having just had to write some C code for a recent project I would really like to say that this actually makes the code a lot harder to follow, as the actual code for what it's really doing is hidden amongst a load of error checking.
Whether or not you use exceptions, you have the same number of possible flow paths. With exceptions, the code can be a lot cleaner than with exceptions, as you don't have to write a check after every function call to verify that it did indeed succeed, and you can now proceed with the rest of the function. Instead, the code tells you when it's gone wrong by throwing an exception.
Exceptions also simplify the function signature: rather than having to add an additional parameter to hold the potential error
code, or to hold the function result (because the return value is used for the error code), exceptions allow the function signature
to specify exactly what is appropriate for the task at hand, with errors being reported "out-of-band". Yes, some functions use
errno
, which helps by providing a similar out-of-band error channel, but it's not a panacea: you have to check and
clear it between every call, otherwise you might be passing invalid data into subsequent functions. Also, it requires that you have
a value you can use for the return type in the case that an error occurs. With exceptions you don't have to worry about either of
these, as they interrupt the code at the point of the error, and you don't have to supply a return value.
Here's three implementations of the same function using error code returns, errno and exceptions:
int foo_with_error_codes(some_type param1,other_type param2,result_type* result) { int error=0; intermediate_type temp; if((error=do_blah(param1,23,&temp)) || (error=do_flibble(param2,temp,result)) { return error; } return 0; } result_type foo_with_errno(some_type param1,other_type param2) { errno=0; intermediate_type temp=do_blah(param1,23); if(errno) { return dummy_result_type_value; } return do_flibble(param2,temp); } result_type foo_with_exceptions(some_type param1,other_type param2) { return do_flibble(param2,do_blah(param1,23)); }
Error Recovery
In all three cases, I've assumed that there's no recovery required if do_blah
succeeds but do_flibble
fails. If recovery was required, additional code would be required. It could be argued that this is where the problems with
exceptions begin, as the code paths for exceptions are hidden, and it is therefore unclear where the cleanup must be done. However,
if you design your code with exceptions in mind I find you still get elegant
code. try
/catch
blocks are ugly: this is where deterministic destruction comes into its own. By
encapsulating resources, and performing changes in an exception-safe manner, you end up with elegant code that behaves gracefully in
the face of exceptions, without cluttering the "happy path". Here's some code:
int foo_with_error_codes(some_type param1,other_type param2,result_type* result) { int error=0; intermediate_type temp; if(error=do_blah(param1,23,&temp)) { return error; } if(error=do_flibble(param2,temp,result)) { cleanup_blah(temp); return error; } return 0; } result_type foo_with_errno(some_type param1,other_type param2) { errno=0; intermediate_type temp=do_blah(param1,23); if(errno) { return dummy_result_type_value; } result_type res=do_flibble(param2,temp); if(errno) { cleanup_blah(temp); return dummy_result_type_value; } return res; } result_type foo_with_exceptions(some_type param1,other_type param2) { return do_flibble(param2,do_blah(param1,23)); } result_type foo_with_exceptions2(some_type param1,other_type param2) { blah_cleanup_guard temp(do_blah(param1,23)); result_type res=do_flibble(param2,temp); temp.dismiss(); return res; }
In the error code cases, we need to explicitly cleanup on error, by calling cleanup_blah
. In the exception case
we've got two possibilities, depending on how your code is structured. In foo_with_exceptions
, everything is just
handled directly: if do_flibble
doesn't take ownership of the intermediate data, it cleans itself up. This might well
be the case if do_blah
returns a type that handles its own resources, such as std::string
or
boost::shared_ptr
. If explicit cleanup might be required, we can write a resource management class such as
blah_cleanup_guard
used by foo_with_exceptions2
, which takes ownership of the effects of
do_blah
, and calls cleanup_blah
in the destructor unless we call dismiss
to indicate that
everything is going OK.
Real Examples
That's enough waffling about made up examples, let's look at some real code. Here's something simple: adding a new value to a
dynamic array of DataType
objects held in a simple dynamic_array
class. Let's assume that objects of
DataType
can somehow fail to be copied: maybe they allocate memory internally, which may therefore fail. We'll also use
a really dumb algorithm that reallocates every time a new element is added. This is not for any reason other than it simplifies the
code: we don't need to check whether or not reallocation is needed.
If we're using exceptions, that failure will manifest as an exception, and our code looks like this:
class DataType { public: DataType(const DataType& other); }; class dynamic_array { private: class heap_data_holder { DataType* data; unsigned initialized_count; public: heap_data_holder(): data(0),initialized_count(0) {} explicit heap_data_holder(unsigned max_count): data((DataType*)malloc(max_count*sizeof(DataType))), initialized_count(0) { if(!data) { throw std::bad_alloc(); } } void append_copy(DataType const& value) { new (data+initialized_count) DataType(value); ++initialized_count; } void swap(heap_data_holder& other) { std::swap(data,other.data); std::swap(initialized_count,other.initialized_count); } unsigned get_count() const { return initialized_count; } ~heap_data_holder() { for(unsigned i=0;i<initialized_count;++i) { data[i].~DataType(); } free(data); } DataType& operator[](unsigned index) { return data[index]; } }; heap_data_holder data; // no copying for now dynamic_array& operator=(dynamic_array& other); dynamic_array(dynamic_array& other); public: dynamic_array() {} void add_element(DataType const& new_value) { heap_data_holder new_data(data.get_count()+1); for(unsigned i=0;i<data.get_count();++i) { new_data.append_copy(data[i]); } new_data.append_copy(new_value); new_data.swap(data); } };
On the other, if we can't use exceptions, the code looks like this:
class DataType { public: DataType(const DataType& other); int get_error(); }; class dynamic_array { private: class heap_data_holder { DataType* data; unsigned initialized_count; int error_code; public: heap_data_holder(): data(0),initialized_count(0),error_code(0) {} explicit heap_data_holder(unsigned max_count): data((DataType*)malloc(max_count*sizeof(DataType))), initialized_count(0), error_code(0) { if(!data) { error_code=out_of_memory; } } int get_error() const { return error_code; } int append_copy(DataType const& value) { new (data+initialized_count) DataType(value); if(data[initialized_count].get_error()) { int const error=data[initialized_count].get_error(); data[initialized_count].~DataType(); return error; } ++initialized_count; return 0; } void swap(heap_data_holder& other) { std::swap(data,other.data); std::swap(initialized_count,other.initialized_count); } unsigned get_count() const { return initialized_count; } ~heap_data_holder() { for(unsigned i=0;i<initialized_count;++i) { data[i].~DataType(); } free(data); } DataType& operator[](unsigned index) { return data[index]; } }; heap_data_holder data; // no copying for now dynamic_array& operator=(dynamic_array& other); dynamic_array(dynamic_array& other); public: dynamic_array() {} int add_element(DataType const& new_value) { heap_data_holder new_data(data.get_count()+1); if(new_data.get_error()) return new_data.get_error(); for(unsigned i=0;i<data.get_count();++i) { int const error=new_data.append_copy(data[i]); if(error) return error; } int const error=new_data.append_copy(new_value); if(error) return error; new_data.swap(data); return 0; } };
It's not too dissimilar, but there's a lot of checks for error codes: add_element
has gone from 10 lines to 17,
which is almost double, and there's also additional checks in the heap_data_holder
class. In my experience, this is
typical: if you have to explicitly write error checks at every failure point rather than use exceptions, your code can get quite a
lot larger for no gain. Also, the constructor of heap_data_holder
can no longer report failure directly: it must store
the error code for later retrieval. To my eyes, the exception-based version is a whole lot clearer and more elegant, as well as
being shorter: a net gain over the error-code version.
Conclusion
I guess it's a matter of taste, but I find code that uses exceptions is shorter, clearer, and actually has fewer bugs than code that uses error codes. Yes, you have to think about the consequences of an exception, and at which points in the code an exception can be thrown, but you have to do that anyway with error codes, and it's easy to write simple resource management classes to ensure everything is taken care of.
Posted by Anthony Williams
[/ design /] permanent link
Tags: exceptions, elegance, software
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2025 Just Software Solutions Ltd. All rights reserved. | Privacy Policy