Blog Archive for / 2008 / 12 /

Managing Threads with a Vector

Wednesday, 10 December 2008

One of the nice things about C++0x is the support for move semantics that comes from the new Rvalue Reference language feature. Since this is a language feature, it means that we can easily have types that are movable but not copyable without resorting to std::auto_ptr-like hackery. One such type is the new std::thread class. A thread of execution can only be associated with one std::thread object at a time, so std::thread is not copyable, but it is movable — this allows you to transfer ownership of a thread of execution between objects, and return std::thread objects from functions. The important point for today's blog post is that it allows you to store std::thread objects in containers.

Move-aware containers

The C++0x standard containers are required to be move-aware, and move objects rather than copy them when changing their position within the container. For existing copyable types that don't have a specific move constructor or move-assignment operator that takes an rvalue reference this amounts to the same thing — when a std::vector is resized, or an element is inserted in the middle, the elements will be copied to their new locations. The important difference is that you can now store types that only have a move-constructor and move-assignment operator in the standard containers because the objects are moved rather than copied.

This means that you can now write code like:

std::vector<std::thread> v;

v.push_back(std::thread(some_function));

and it all "just works". This is good news for managing multiple threads where the number of threads is not known until run-time — if you're tuning the number of threads to the number of processors, using std::thread::hardware_concurrency() for example. It also means that you can then use the std::vector<std::thread> with the standard library algorithms such as std::for_each:

void do_join(std::thread& t)
{
    t.join();
}

void join_all(std::vector<std::thread>& v)
{
    std::for_each(v.begin(),v.end(),do_join);
}

If you need an extra thread because one of your threads is blocked waiting for something, you can just use insert() or push_back() to add a new thread to the vector. Of course you can also just move threads into or out of the vector by indexing the elements directly:

std::vector<std::thread> v(std::thread::hardware_concurrency());

for(unsigned i=0;i<v.size();++i)
{
    v[i]=std::thread(do_work);
}

In fact, many of the examples in my book use std::vector<std::thread> for managing the threads, as it's the simplest way to do it.

Other containers work too

It's not just std::vector that's required to be move-aware — all the other standard containers are too. This means you can have a std::list<std::thread>, or a std::deque<std::thread>, or even a std::map<int,std::thread>. In fact, the whole C++0x standard library is designed to work with move-only types such as std::thread.

Try it out today

Wouldn't it be nice if you could try it out today, and get used to using containers of std::thread objects without having to wait for a C++0x compiler? Well, you can — the 0.6 beta of the just::thread C++0x Thread Library released last Friday provides a specialization of std::vector<std::thread> so that you can write code like in these examples and it will work with Microsoft Visual Studio 2008. Sign up at the just::thread Support Forum to download it today.

Posted by Anthony Williams
[/ threading /] permanent link
Tags: threads, C++0x, vector, multithreading, concurrency
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Peterson's lock with C++0x atomics

Friday, 05 December 2008

Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:

"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."

I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

This is something I wholeheartedly agree with. If you weren't convinced by my previous post on Memory Models and Synchronization, maybe the proof below will convince you to stick to memory_order_seq_cst (the default) unless you really need to do otherwise.

C++0x memory ordering recap

In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:

Synchronizes-with

An operation A synchronizes-with an operation B if A is a store to some atomic variable m, with an ordering of std::memory_order_release, or std::memory_order_seq_cst, B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst, and B reads the value stored by A.

Happens-before

An operation A happens-before an operation B if:

A is performed on the same thread as B, and A is before B in program order, or
A synchronizes-with B, or
A happens-before some other operation C, and C happens-before B.

There's a few more nuances to do with std::memory_order_consume, but this is enough for now.

If all your operations use std::memory_order_seq_cst, then there is the additional constraint of total ordering, as I mentioned before, but neither of the implementations in question use any std::memory_order_seq_cst operations, so we can leave that aside for now.

Now, let's look at the implementations.

Bartosz's implementation

I've extracted the code for Bartosz's implementation from his posts, and it is shown below:

class Peterson_Bartosz
{
private:
    // indexed by thread ID, 0 or 1
    std::atomic<bool> _interested[2];
    // who's yielding priority?
    std::atomic<int> _victim;
public:
    Peterson_Bartosz()
    {
       _victim.store(0, std::memory_order_release);
       _interested[0].store(false, std::memory_order_release);
       _interested[1].store(false, std::memory_order_release);
    }
    void lock()
    {
       int me = threadID; // either 0 or 1
       int he = 1 ? me; // the other thread
       _interested[me].exchange(true, std::memory_order_acq_rel);
       _victim.store(me, std::memory_order_release);
       while (_interested[he].load(std::memory_order_acquire)
           && _victim.load(std::memory_order_acquire) == me)
          continue; // spin
    }
    void unlock()
    {
        int me = threadID;
        _interested[me].store(false,std::memory_order_release);
    }
}

There are three things to prove with Peterson's lock:

If thread 0 successfully acquires the lock, then thread 1 will not do so;
If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
If thread 0 fails to acquire the lock, then thread 1 does so.

Let's look at each in turn.

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially _victim is 0, and the _interested variables are both false. The call to lock() from thread 0 will then set _interested[0] to true, and _victim to 0.

The loop then checks _interested[1], which is still false, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets _interested[1] to true, and _victim to 1, and then enters the while loop. This is where the fun begins.

The first thing we check is _interested[0]. Now, we know this was set to true in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

For it to be guaranteed by the memory model, we have to prove that the store to _interested[0] from thread 0 happens-before the load from thread 1. This is trivially true if we read true in thread 1, but that doesn't help: we need to prove that we can't read false. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty: _interested[1] is loaded by thread 1 as part of the exchange call, but it is not written by thread 0, and _victim is written by thread 1 without reading the value stored by thread 0. Consequently, there is no ordering guarantee on the read of _interested[0], and thread 1 may also break out of the while loop and acquire the lock.

This implementation is thus broken. Let's now look at Dmitriy's implementation.

Dmitriy's implementation

Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:

std::atomic<int> flag0(0),flag1(0),turn(0);

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_relaxed);
        turn.exchange(1, std::memory_order_acq_rel);

        while (flag1.load(std::memory_order_acquire)
            && 1 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
    else
    {
        flag1.store(1, std::memory_order_relaxed);
        turn.exchange(0, std::memory_order_acq_rel);

        while (flag0.load(std::memory_order_acquire)
            && 0 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
}

void unlock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(0, std::memory_order_release);
    }
    else
    {
        flag1.store(0, std::memory_order_release);
    }
}

So, how does this code fare?

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially the turn, flag0 and flag1 variables are all 0. The call to lock() from thread 0 will then set flag0 to 1, and turn to 1. These variables are essentially equivalent to the variables in Bartosz's implementation, but turn is set to 0 when _victim is set to 1, and vice-versa. That doesn't affect the logic of the code.

The loop then checks flag1, which is still 0, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets flag1 to 1, and turn to 0, and then enters the while loop. This is where the fun begins.

As before, the first thing we check is flag0. Now, we know this was set to 1 in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

Again, for it to be guaranteed by the memory model, we have to prove that the store to flag0 from thread 0 happens-before the load from thread 1. This is trivially true if we read 1 in thread 1, but that doesn't help: we need to prove that we can't read 0. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, as before.

This time our search is successful: turn is set using an exchange operation, which is a read-modify-write operation. Since it uses std::memory_order_acq_rel memory ordering, it is both a load-acquire and a store-release. If the load part of the exchange reads the value written by thread 0, we're home dry: turn is stored with a similar exchange operation with std::memory_order_acq_rel in thread 0, so the store from thread 0 synchronizes-with the load from thread 1.

This means that the store to flag0 from thread 0 happens-before the exchange on turn in thread 1, and thus happens-before the load in the while loop. The load in the while loop thus reads 1 from flag0, and proceeds to check turn.

Now, since the store to turn from thread 0 happens-before the store from thread 1 (we're relying on that for the happens-before relationship on flag0, remember), we know that the value to be read will be the value we stored in thread 1: 0. Consequently, we keep looping.

OK, so if the store to turn in thread 1 reads the value stored by thread 0 then thread 1 will stay out of the lock, but what if it doesn't read the value store by thread 0? In this case, we know that the exchange call from thread 0 must have seen the value written by the exchange in thread 1 (writes to a single atomic variable always become visible in the same order for all threads), which means that the write to flag1 from thread 1 happens-before the read in thread 0 and so thread 0 cannot have acquired the lock. Since this was our initial assumption (thread 0 has acquired the lock), we're home dry — thread 1 can only acquire the lock if thread 0 didn't.

If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock

OK, so we've got as far as thread 0 acquiring the lock and thread 1 waiting. What happens if thread 0 now releases the lock? It does this simply by writing 0 to flag0. The while loop in thread 1 checks flag0 every time round, and breaks out if the value read is 0. Therefore, thread 1 will eventually acquire the mutex. Of course, there is no guarantee when it will acquire the mutex — it might take arbitrarily long for the the write to flag0 to make its way to thread 1, but it will get there in the end. Since flag0 is never written by thread 1, it doesn't matter whether thread 0 has already released the lock when thread 1 starts waiting, or whether thread 1 is already waiting — the while loop will still terminate, and thread 1 will acquire the lock in both cases.

That just leaves our final check.

If thread 0 fails to acquire the lock, then thread 1 does so

We've essentially already covered this when we checked that thread 1 doesn't acquire the lock if thread 0 does, but this time we're going in reverse. If thread 0 doesn't acquire the lock, it is because it sees flag1 as 1 and turn as 1. Since flag1 is only written by thread 1, if it is 1 then thread 1 must have at least called lock(). If thread 1 has called unlock then eventually flag1 will be read as 0, so thread 0 will acquire the lock. So, let's assume for now that thread 1 hasn't got that far, so flag1 is still 1. The next check is for turn to be 1. This is the value written by thread 0. If we read it as 1 then either the write to turn from thread 1 has not yet become visible to thread 0, or the write happens-before the write by thread 0, so the write from thread 0 overwrote the old value.

If the write from thread 1 happens-before the write from thread 0 then thread 1 will eventually see turn as 1 (since the last write is by thread 0), and thus thread 1 will acquire the lock. On the other hand, if the write to turn from thread 0 happens-before the write to turn from thread 1, then thread 0 will eventually see the turn as 0 and acquire the lock. Therefore, for thread 0 to be stuck waiting the last write to turn must have been by thread 0, which implies thread 1 will eventually get the lock.

Therefore, Dmitriy's implementation works.

Differences, and conclusion

The key difference between the implementations other than the naming of the variables is which variable the exchange operation is applied to. In Bartosz's implementation, the exchange is applied to _interested[me], which is only ever written by one thread for a given value of me. In Dmitriy's implementation, the exchange is applied to the turn variable, which is the variable updated by both threads. It therefore acts as a synchronization point for the threads. This is the key to the whole algorithm — even though many of the operations in Dmitriy's implementation use std::memory_order_relaxed, whereas Bartosz's implementation uses std::memory_order_acquire and std::memory_order_release everywhere, the single std::memory_order_acq_rel on the exchange on the right variable is enough.

I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, C++0x, atomics, memory model, synchronization
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Rvalue References and Perfect Forwarding in C++0x

Wednesday, 03 December 2008

One of the new features in C++0x is the rvalue reference. Whereas the a "normal" lvalue reference is declared with a single ampersand &, an rvalue reference is declared with two ampersands: &&. The key difference is of course that an rvalue reference can bind to an rvalue, whereas a non-const lvalue reference cannot. This is primarily used to support move semantics for expensive-to-copy objects:

class X
{
    std::vector<double> data;
public:
    X():
        data(100000) // lots of data
    {}
    X(X const& other): // copy constructor
        data(other.data)   // duplicate all that data
    {}

    X(X&& other):  // move constructor
        data(std::move(other.data)) // move the data: no copies
    {}

    X& operator=(X const& other) // copy-assignment
    {
        data=other.data; // copy all the data
        return *this;
    }

    X& operator=(X && other) // move-assignment
    {
        data=std::move(other.data); // move the data: no copies
        return *this;
    }

};

X make_x(); // build an X with some data

int main()
{
    X x1;
    X x2(x1); // copy
    X x3(std::move(x1)); // move: x1 no longer has any data

    x1=make_x(); // return value is an rvalue, so move rather than copy
}

Though move semantics are powerful, rvalue references offer more than that.

Perfect Forwarding

When you combine rvalue references with function templates you get an interesting interaction: if the type of a function parameter is an rvalue reference to a template type parameter then the type parameter is deduce to be an lvalue reference if an lvalue is passed, and a plain type otherwise. This sounds complicated, so lets look at an example:

template<typename T>
void f(T&& t);

int main()
{
    X x;
    f(x);   // 1
    f(X()); // 2
}

The function template f meets our criterion above, so in the call f(x) at the line marked "1", the template parameter T is deduced to be X&, whereas in the line marked "2", the supplied parameter is an rvalue (because it's a temporary), so T is deduced to be X.

Why is this useful? Well, it means that a function template can pass its arguments through to another function whilst retaining the lvalue/rvalue nature of the function arguments by using std::forward. This is called "perfect forwarding", avoids excessive copying, and avoids the template author having to write multiple overloads for lvalue and rvalue references. Let's look at an example:

void g(X&& t); // A
void g(X& t);      // B

template<typename T>
void f(T&& t)
{
    g(std::forward<T>(t));
}

void h(X&& t)
{
    g(t);
}

int main()
{
    X x;
    f(x);   // 1
    f(X()); // 2
    h(x);
    h(X()); // 3
}

This time our function f forwards its argument to a function g which is overloaded for lvalue and rvalue references to an X object. g will therefore accept lvalues and rvalues alike, but overload resolution will bind to a different function in each case.

At line "1", we pass a named X object to f, so T is deduced to be an lvalue reference: X&, as we saw above. When T is an lvalue reference, std::forward<T> is a no-op: it just returns its argument. We therefore call the overload of g that takes an lvalue reference (line B).

At line "2", we pass a temporary to f, so T is just plain X. In this case, std::forward<T>(t) is equivalent to static_cast<T&&>(t): it ensures that the argument is forwarded as an rvalue reference. This means that the overload of g that takes an rvalue reference is selected (line A).

This is called perfect forwarding because the same overload of g is selected as if the same argument was supplied to g directly. It is essential for library features such as std::function and std::thread which pass arguments to another (user supplied) function.

Note that this is unique to template functions: we can't do this with a non-template function such as h, since we don't know whether the supplied argument is an lvalue or an rvalue. Within a function that takes its arguments as rvalue references, the named parameter is treated as an lvalue reference. Consequently the call to g(t) from h always calls the lvalue overload. If we changed the call to g(std::forward<X>(t)) then it would always call the rvalue-reference overload. The only way to do this with "normal" functions is to create two overloads: one for lvalues and one for rvalues.

Now imagine that we remove the overload of g for rvalue references (delete line A). Calling f with an rvalue (line 2) will now fail to compile because you can't call g with an rvalue. On the other hand, our call to h with an rvalue (line 3) will still compile however, since it always calls the lvalue-reference overload of g. This can lead to interesting problems if g stores the reference for later use.

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog

Blog Archives

Blog Archive for / 2008 / 12 /

Managing Threads with a Vector

Wednesday, 10 December 2008

Move-aware containers

Other containers work too

Try it out today

Peterson's lock with C++0x atomics

Friday, 05 December 2008

C++0x memory ordering recap

Bartosz's implementation

If thread 0 successfully acquires the lock, then thread 1 will not do so

Dmitriy's implementation

If thread 0 successfully acquires the lock, then thread 1 will not do so

If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock

If thread 0 fails to acquire the lock, then thread 1 does so

Differences, and conclusion

Rvalue References and Perfect Forwarding in C++0x

Wednesday, 03 December 2008

Perfect Forwarding

Further Reading