Blog Archive for / 2008 / 12 /
Managing Threads with a Vector
Wednesday, 10 December 2008
One of the nice things about C++0x is the support for move
semantics that comes from the new Rvalue
Reference language feature. Since this is a language feature, it
means that we can easily have types that are movable but not
copyable without resorting to std::auto_ptr
-like
hackery. One such type is the new std::thread
class. A
thread of execution can only be associated with one
std::thread
object at a time, so std::thread
is not copyable, but it is movable — this
allows you to transfer ownership of a thread of execution between
objects, and return std::thread
objects from
functions. The important point for today's blog post is that it allows
you to store std::thread
objects in containers.
Move-aware containers
The C++0x standard containers are required to be move-aware, and
move objects rather than copy them when changing
their position within the container. For existing copyable types that
don't have a specific move constructor or move-assignment
operator that takes an rvalue reference this amounts to the same
thing — when a std::vector
is resized, or an
element is inserted in the middle, the elements will be copied to
their new locations. The important difference is that you can now
store types that only have a move-constructor and
move-assignment operator in the standard containers because the
objects are moved rather than copied.
This means that you can now write code like:
std::vector<std::thread> v; v.push_back(std::thread(some_function));
and it all "just works". This is good news for managing multiple
threads where the number of threads is not known until run-time
— if you're tuning the number of threads to the number of
processors, using
std::thread::hardware_concurrency()
for example. It also
means that you can then use the
std::vector<std::thread>
with the standard library
algorithms such as std::for_each
:
void do_join(std::thread& t) { t.join(); } void join_all(std::vector<std::thread>& v) { std::for_each(v.begin(),v.end(),do_join); }
If you need an extra thread because one of your threads is blocked
waiting for something, you can just use insert()
or
push_back()
to add a new thread to the
vector
. Of course you can also just move threads into or
out of the vector
by indexing the elements directly:
std::vector<std::thread> v(std::thread::hardware_concurrency()); for(unsigned i=0;i<v.size();++i) { v[i]=std::thread(do_work); }
In fact, many of the examples in my
book use std::vector<std::thread>
for managing
the threads, as it's the simplest way to do it.
Other containers work too
It's not just std::vector
that's required to be
move-aware — all the other standard containers are too. This
means you can have a std::list<std::thread>
, or a
std::deque<std::thread>
, or even a
std::map<int,std::thread>
. In fact, the whole C++0x
standard library is designed to work with move-only types such as
std::thread
.
Try it out today
Wouldn't it be nice if you could try it out today, and get used to
using containers of std::thread
objects without having to
wait for a C++0x compiler? Well, you can — the
0.6 beta of the just::thread
C++0x
Thread Library released last Friday provides a specialization of
std::vector<std::thread>
so that you can write code
like in these examples and it will work with Microsoft Visual Studio
2008. Sign up at the just::thread
Support Forum to download it today.
Posted by Anthony Williams
[/ threading /] permanent link
Tags: threads, C++0x, vector, multithreading, concurrency
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Peterson's lock with C++0x atomics
Friday, 05 December 2008
Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:
"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."
I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
This is something I wholeheartedly agree with. If you weren't
convinced by my previous post on Memory
Models and Synchronization, maybe the proof below will convince
you to stick to memory_order_seq_cst
(the default) unless
you really need to do otherwise.
C++0x memory ordering recap
In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:
- Synchronizes-with
- An operation A synchronizes-with an
operation B if A is a store to some
atomic variable
m
, with an ordering ofstd::memory_order_release
, orstd::memory_order_seq_cst
, B is a load from the same variablem
, with an ordering ofstd::memory_order_acquire
orstd::memory_order_seq_cst
, and B reads the value stored by A. - Happens-before
- An operation A happens-before an
operation B if:
- A is performed on the same thread as B, and A is before B in program order, or
- A synchronizes-with B, or
- A happens-before some other operation C, and C happens-before B.
std::memory_order_consume
, but this is enough for now.
If all your operations use std::memory_order_seq_cst
,
then there is the additional constraint of total ordering, as I
mentioned before, but neither of the implementations in question use
any std::memory_order_seq_cst
operations, so we
can leave that aside for now.
Now, let's look at the implementations.
Bartosz's implementation
I've extracted the code for Bartosz's implementation from his posts, and it is shown below:
class Peterson_Bartosz { private: // indexed by thread ID, 0 or 1 std::atomic<bool> _interested[2]; // who's yielding priority? std::atomic<int> _victim; public: Peterson_Bartosz() { _victim.store(0, std::memory_order_release); _interested[0].store(false, std::memory_order_release); _interested[1].store(false, std::memory_order_release); } void lock() { int me = threadID; // either 0 or 1 int he = 1 ? me; // the other thread _interested[me].exchange(true, std::memory_order_acq_rel); _victim.store(me, std::memory_order_release); while (_interested[he].load(std::memory_order_acquire) && _victim.load(std::memory_order_acquire) == me) continue; // spin } void unlock() { int me = threadID; _interested[me].store(false,std::memory_order_release); } }
There are three things to prove with Peterson's lock:
- If thread 0 successfully acquires the lock, then thread 1 will not do so;
- If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
- If thread 0 fails to acquire the lock, then thread 1 does so.
Let's look at each in turn.
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially _victim
is 0, and the
_interested
variables are both false
. The
call to lock()
from thread 0 will then set
_interested[0]
to true
, and
_victim
to 0.
The loop then checks _interested[1]
, which is still
false
, so we break out of the loop, and the lock is
acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets _interested[1]
to
true
, and _victim
to 1, and then enters the
while
loop. This is where the fun begins.
The first thing we check is _interested[0]
. Now, we
know this was set to true
in thread 0 as it acquired the
lock, but the important thing is: does the CPU running thread 1 know
that? Is it guaranteed by the memory model?
For it to be guaranteed by the memory model, we have to prove that
the store to _interested[0]
from thread 0
happens-before the load from thread 1. This is trivially true
if we read true
in thread 1, but that doesn't help: we
need to prove that we can't read false
. We
therefore need to find a variable which was stored by thread 0, and
loaded by thread 1, and our search comes up empty:
_interested[1]
is loaded by thread 1 as part of the
exchange
call, but it is not written by thread 0, and
_victim
is written by thread 1 without reading the value
stored by thread 0. Consequently, there is no ordering
guarantee on the read of _interested[0]
, and thread 1
may also break out of the while
loop and acquire
the lock.
This implementation is thus broken. Let's now look at Dmitriy's implementation.
Dmitriy's implementation
Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:
std::atomic<int> flag0(0),flag1(0),turn(0); void lock(unsigned index) { if (0 == index) { flag0.store(1, std::memory_order_relaxed); turn.exchange(1, std::memory_order_acq_rel); while (flag1.load(std::memory_order_acquire) && 1 == turn.load(std::memory_order_relaxed)) std::this_thread::yield(); } else { flag1.store(1, std::memory_order_relaxed); turn.exchange(0, std::memory_order_acq_rel); while (flag0.load(std::memory_order_acquire) && 0 == turn.load(std::memory_order_relaxed)) std::this_thread::yield(); } } void unlock(unsigned index) { if (0 == index) { flag0.store(0, std::memory_order_release); } else { flag1.store(0, std::memory_order_release); } }
So, how does this code fare?
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially the turn
, flag0
and
flag1
variables are all 0. The call to
lock()
from thread 0 will then set flag0
to
1, and turn
to 1. These variables are essentially
equivalent to the variables in Bartosz's implementation, but
turn
is set to 0 when _victim
is set to 1,
and vice-versa. That doesn't affect the logic of the code.
The loop then checks flag1
, which is still 0, so we
break out of the loop, and the lock is acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets flag1
to 1, and
turn
to 0, and then enters the while
loop. This is where the fun begins.
As before, the first thing we check is flag0
. Now, we
know this was set to 1 in thread 0 as it acquired the lock, but the
important thing is: does the CPU running thread 1 know that? Is it
guaranteed by the memory model?
Again, for it to be guaranteed by the memory model, we have to
prove that the store to flag0
from thread 0
happens-before the load from thread 1. This is trivially true
if we read 1 in thread 1, but that doesn't help: we need to prove that
we can't read 0. We therefore need to find a variable which
was stored by thread 0, and loaded by thread 1, as before.
This time our search is successful: turn
is set using
an exchange
operation, which is a read-modify-write
operation. Since it uses std::memory_order_acq_rel
memory
ordering, it is both a load-acquire and a store-release. If the load
part of the exchange
reads the value written by thread 0,
we're home dry: turn
is stored with a similar
exchange
operation with
std::memory_order_acq_rel
in thread 0, so the store from
thread 0 synchronizes-with the load from thread 1.
This means that the store to flag0
from thread 0
happens-before the exchange
on turn
in thread 1, and thus happens-before the load
in
the while
loop. The load
in the
while
loop thus reads 1 from flag0
, and
proceeds to check turn
.
Now, since the store to turn
from thread 0
happens-before the store from thread 1 (we're relying on that
for the happens-before relationship on flag0
,
remember), we know that the value to be read will be the value we
stored in thread 1: 0. Consequently, we keep looping.
OK, so if the store to turn
in thread 1 reads the
value stored by thread 0 then thread 1 will stay out of the lock, but
what if it doesn't read the value store by thread 0? In this
case, we know that the exchange
call from thread 0 must
have seen the value written by the exchange
in thread 1
(writes to a single atomic variable always become visible in the same
order for all threads), which means that the write to
flag1
from thread 1 happens-before the read in
thread 0 and so thread 0 cannot have acquired the lock. Since this was
our initial assumption (thread 0 has acquired the lock), we're home
dry — thread 1 can only acquire the lock if thread 0 didn't.
If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock
OK, so we've got as far as thread 0 acquiring the lock and thread 1
waiting. What happens if thread 0 now releases the lock? It does this
simply by writing 0 to flag0
. The while
loop
in thread 1 checks flag0
every time round, and breaks out
if the value read is 0. Therefore, thread 1 will eventually acquire
the mutex. Of course, there is no guarantee when it will
acquire the mutex — it might take arbitrarily long for the the
write to flag0
to make its way to thread 1, but it will
get there in the end. Since flag0
is never written by
thread 1, it doesn't matter whether thread 0 has already released the
lock when thread 1 starts waiting, or whether thread 1 is already
waiting — the while
loop will still terminate, and
thread 1 will acquire the lock in both cases.
That just leaves our final check.
If thread 0 fails to acquire the lock, then thread 1 does so
We've essentially already covered this when we checked that thread
1 doesn't acquire the lock if thread 0 does, but this time we're going
in reverse. If thread 0 doesn't acquire the lock, it is because it
sees flag1
as 1 and turn
as 1. Since
flag1
is only written by thread 1, if it is 1 then thread
1 must have at least called lock()
. If thread 1 has
called unlock
then eventually flag1
will be
read as 0, so thread 0 will acquire the lock. So, let's assume for now
that thread 1 hasn't got that far, so flag1
is still
1. The next check is for turn
to be 1. This is the value
written by thread 0. If we read it as 1 then either the write to
turn
from thread 1 has not yet become visible to thread
0, or the write happens-before the write by thread 0, so the
write from thread 0 overwrote the old value.
If the write from thread 1 happens-before the write from
thread 0 then thread 1 will eventually see turn
as 1
(since the last write is by thread 0), and thus thread 1 will acquire
the lock. On the other hand, if the write to turn
from
thread 0 happens-before the write to turn
from
thread 1, then thread 0 will eventually see the turn
as 0
and acquire the lock. Therefore, for thread 0 to be stuck waiting the
last write to turn
must have been by thread 0, which
implies thread 1 will eventually get the lock.
Therefore, Dmitriy's implementation works.
Differences, and conclusion
The key difference between the implementations other than the
naming of the variables is which variable the exchange
operation is applied to. In Bartosz's implementation, the
exchange
is applied to _interested[me]
,
which is only ever written by one thread for a given value of
me
. In Dmitriy's implementation, the
exchange
is applied to the turn
variable,
which is the variable updated by both threads. It therefore acts as a
synchronization point for the threads. This is the key to the whole
algorithm — even though many of the operations in Dmitriy's
implementation use std::memory_order_relaxed
, whereas
Bartosz's implementation uses std::memory_order_acquire
and std::memory_order_release
everywhere, the single
std::memory_order_acq_rel
on the exchange
on
the right variable is enough.
I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, C++0x, atomics, memory model, synchronization
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Rvalue References and Perfect Forwarding in C++0x
Wednesday, 03 December 2008
One of the new features
in C++0x
is the rvalue reference. Whereas the a "normal" lvalue
reference is declared with a single ampersand &
,
an rvalue reference is declared with two
ampersands: &&
. The key difference is of course
that an rvalue reference can bind to an rvalue, whereas a
non-const
lvalue reference cannot. This is primarily used
to support move semantics for expensive-to-copy objects:
class X { std::vector<double> data; public: X(): data(100000) // lots of data {} X(X const& other): // copy constructor data(other.data) // duplicate all that data {} X(X&& other): // move constructor data(std::move(other.data)) // move the data: no copies {} X& operator=(X const& other) // copy-assignment { data=other.data; // copy all the data return *this; } X& operator=(X && other) // move-assignment { data=std::move(other.data); // move the data: no copies return *this; } }; X make_x(); // build an X with some data int main() { X x1; X x2(x1); // copy X x3(std::move(x1)); // move: x1 no longer has any data x1=make_x(); // return value is an rvalue, so move rather than copy }
Though move semantics are powerful, rvalue references offer more than that.
Perfect Forwarding
When you combine rvalue references with function templates you get an interesting interaction: if the type of a function parameter is an rvalue reference to a template type parameter then the type parameter is deduce to be an lvalue reference if an lvalue is passed, and a plain type otherwise. This sounds complicated, so lets look at an example:
template<typename T> void f(T&& t); int main() { X x; f(x); // 1 f(X()); // 2 }
The function template f
meets our criterion above, so
in the call f(x)
at the line marked "1", the template
parameter T
is deduced to be X&
,
whereas in the line marked "2", the supplied parameter is an rvalue
(because it's a temporary), so T
is deduced to
be X
.
Why is this useful? Well, it means that a function template can
pass its arguments through to another function whilst retaining the
lvalue/rvalue nature of the function arguments by
using std::forward
. This is called "perfect
forwarding", avoids excessive copying, and avoids the template
author having to write multiple overloads for lvalue and rvalue
references. Let's look at an example:
void g(X&& t); // A void g(X& t); // B template<typename T> void f(T&& t) { g(std::forward<T>(t)); } void h(X&& t) { g(t); } int main() { X x; f(x); // 1 f(X()); // 2 h(x); h(X()); // 3 }
This time our function f
forwards its argument to a
function g
which is overloaded for lvalue and rvalue
references to an X
object. g
will
therefore accept lvalues and rvalues alike, but overload resolution
will bind to a different function in each case.
At line "1", we pass a named X
object
to f
, so T
is deduced to be an lvalue
reference: X&
, as we saw above. When T
is an lvalue reference, std::forward<T>
is a
no-op: it just returns its argument. We therefore call the overload
of g
that takes an lvalue reference (line B).
At line "2", we pass a temporary to f
,
so T
is just plain X
. In this
case, std::forward<T>(t)
is equivalent
to static_cast<T&&>(t)
: it ensures that
the argument is forwarded as an rvalue reference. This means that
the overload of g
that takes an rvalue reference is
selected (line A).
This is called perfect forwarding because the same
overload of g
is selected as if the same argument was
supplied to g
directly. It is essential for library
features such as std::function
and std::thread
which pass arguments to another (user
supplied) function.
Note that this is unique to template functions: we can't do this
with a non-template function such as h
, since we don't
know whether the supplied argument is an lvalue or an rvalue. Within
a function that takes its arguments as rvalue references, the named
parameter is treated as an lvalue reference. Consequently the call
to g(t)
from h
always calls the lvalue
overload. If we changed the call
to g(std::forward<X>(t))
then it would always
call the rvalue-reference overload. The only way to do this with
"normal" functions is to create two overloads: one for lvalues and
one for rvalues.
Now imagine that we remove the overload of g
for
rvalue references (delete line A). Calling f
with an
rvalue (line 2) will now fail to compile because you can't
call g
with an rvalue. On the other hand, our call
to h
with an rvalue (line 3) will still
compile however, since it always calls the lvalue-reference
overload of g
. This can lead to interesting problems
if g
stores the reference for later use.
Further Reading
For more information, I suggest reading the accepted rvalue reference paper and "A Brief Introduction to Rvalue References", as well as the current C++0x working draft.
Posted by Anthony Williams
[/ cplusplus /] permanent link
Tags: rvalue reference, cplusplus, C++0x, forwarding
Stumble It! | Submit to Reddit
| Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2025 Just Software Solutions Ltd. All rights reserved. | Privacy Policy