Peterson's lock with C++0x atomics

Friday, 05 December 2008

Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:

"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."

I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

This is something I wholeheartedly agree with. If you weren't convinced by my previous post on Memory Models and Synchronization, maybe the proof below will convince you to stick to memory_order_seq_cst (the default) unless you really need to do otherwise.

C++0x memory ordering recap

In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:

Synchronizes-with

An operation A synchronizes-with an operation B if A is a store to some atomic variable m, with an ordering of std::memory_order_release, or std::memory_order_seq_cst, B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst, and B reads the value stored by A.

Happens-before

An operation A happens-before an operation B if:

A is performed on the same thread as B, and A is before B in program order, or
A synchronizes-with B, or
A happens-before some other operation C, and C happens-before B.

There's a few more nuances to do with std::memory_order_consume, but this is enough for now.

If all your operations use std::memory_order_seq_cst, then there is the additional constraint of total ordering, as I mentioned before, but neither of the implementations in question use any std::memory_order_seq_cst operations, so we can leave that aside for now.

Now, let's look at the implementations.

Bartosz's implementation

I've extracted the code for Bartosz's implementation from his posts, and it is shown below:

class Peterson_Bartosz
{
private:
    // indexed by thread ID, 0 or 1
    std::atomic<bool> _interested[2];
    // who's yielding priority?
    std::atomic<int> _victim;
public:
    Peterson_Bartosz()
    {
       _victim.store(0, std::memory_order_release);
       _interested[0].store(false, std::memory_order_release);
       _interested[1].store(false, std::memory_order_release);
    }
    void lock()
    {
       int me = threadID; // either 0 or 1
       int he = 1 ? me; // the other thread
       _interested[me].exchange(true, std::memory_order_acq_rel);
       _victim.store(me, std::memory_order_release);
       while (_interested[he].load(std::memory_order_acquire)
           && _victim.load(std::memory_order_acquire) == me)
          continue; // spin
    }
    void unlock()
    {
        int me = threadID;
        _interested[me].store(false,std::memory_order_release);
    }
}

There are three things to prove with Peterson's lock:

If thread 0 successfully acquires the lock, then thread 1 will not do so;
If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
If thread 0 fails to acquire the lock, then thread 1 does so.

Let's look at each in turn.

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially _victim is 0, and the _interested variables are both false. The call to lock() from thread 0 will then set _interested[0] to true, and _victim to 0.

The loop then checks _interested[1], which is still false, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets _interested[1] to true, and _victim to 1, and then enters the while loop. This is where the fun begins.

The first thing we check is _interested[0]. Now, we know this was set to true in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

For it to be guaranteed by the memory model, we have to prove that the store to _interested[0] from thread 0 happens-before the load from thread 1. This is trivially true if we read true in thread 1, but that doesn't help: we need to prove that we can't read false. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty: _interested[1] is loaded by thread 1 as part of the exchange call, but it is not written by thread 0, and _victim is written by thread 1 without reading the value stored by thread 0. Consequently, there is no ordering guarantee on the read of _interested[0], and thread 1 may also break out of the while loop and acquire the lock.

This implementation is thus broken. Let's now look at Dmitriy's implementation.

Dmitriy's implementation

Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:

std::atomic<int> flag0(0),flag1(0),turn(0);

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_relaxed);
        turn.exchange(1, std::memory_order_acq_rel);

        while (flag1.load(std::memory_order_acquire)
            && 1 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
    else
    {
        flag1.store(1, std::memory_order_relaxed);
        turn.exchange(0, std::memory_order_acq_rel);

        while (flag0.load(std::memory_order_acquire)
            && 0 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
}

void unlock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(0, std::memory_order_release);
    }
    else
    {
        flag1.store(0, std::memory_order_release);
    }
}

So, how does this code fare?

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially the turn, flag0 and flag1 variables are all 0. The call to lock() from thread 0 will then set flag0 to 1, and turn to 1. These variables are essentially equivalent to the variables in Bartosz's implementation, but turn is set to 0 when _victim is set to 1, and vice-versa. That doesn't affect the logic of the code.

The loop then checks flag1, which is still 0, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets flag1 to 1, and turn to 0, and then enters the while loop. This is where the fun begins.

As before, the first thing we check is flag0. Now, we know this was set to 1 in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

Again, for it to be guaranteed by the memory model, we have to prove that the store to flag0 from thread 0 happens-before the load from thread 1. This is trivially true if we read 1 in thread 1, but that doesn't help: we need to prove that we can't read 0. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, as before.

This time our search is successful: turn is set using an exchange operation, which is a read-modify-write operation. Since it uses std::memory_order_acq_rel memory ordering, it is both a load-acquire and a store-release. If the load part of the exchange reads the value written by thread 0, we're home dry: turn is stored with a similar exchange operation with std::memory_order_acq_rel in thread 0, so the store from thread 0 synchronizes-with the load from thread 1.

This means that the store to flag0 from thread 0 happens-before the exchange on turn in thread 1, and thus happens-before the load in the while loop. The load in the while loop thus reads 1 from flag0, and proceeds to check turn.

Now, since the store to turn from thread 0 happens-before the store from thread 1 (we're relying on that for the happens-before relationship on flag0, remember), we know that the value to be read will be the value we stored in thread 1: 0. Consequently, we keep looping.

OK, so if the store to turn in thread 1 reads the value stored by thread 0 then thread 1 will stay out of the lock, but what if it doesn't read the value store by thread 0? In this case, we know that the exchange call from thread 0 must have seen the value written by the exchange in thread 1 (writes to a single atomic variable always become visible in the same order for all threads), which means that the write to flag1 from thread 1 happens-before the read in thread 0 and so thread 0 cannot have acquired the lock. Since this was our initial assumption (thread 0 has acquired the lock), we're home dry — thread 1 can only acquire the lock if thread 0 didn't.

If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock

OK, so we've got as far as thread 0 acquiring the lock and thread 1 waiting. What happens if thread 0 now releases the lock? It does this simply by writing 0 to flag0. The while loop in thread 1 checks flag0 every time round, and breaks out if the value read is 0. Therefore, thread 1 will eventually acquire the mutex. Of course, there is no guarantee when it will acquire the mutex — it might take arbitrarily long for the the write to flag0 to make its way to thread 1, but it will get there in the end. Since flag0 is never written by thread 1, it doesn't matter whether thread 0 has already released the lock when thread 1 starts waiting, or whether thread 1 is already waiting — the while loop will still terminate, and thread 1 will acquire the lock in both cases.

That just leaves our final check.

If thread 0 fails to acquire the lock, then thread 1 does so

We've essentially already covered this when we checked that thread 1 doesn't acquire the lock if thread 0 does, but this time we're going in reverse. If thread 0 doesn't acquire the lock, it is because it sees flag1 as 1 and turn as 1. Since flag1 is only written by thread 1, if it is 1 then thread 1 must have at least called lock(). If thread 1 has called unlock then eventually flag1 will be read as 0, so thread 0 will acquire the lock. So, let's assume for now that thread 1 hasn't got that far, so flag1 is still 1. The next check is for turn to be 1. This is the value written by thread 0. If we read it as 1 then either the write to turn from thread 1 has not yet become visible to thread 0, or the write happens-before the write by thread 0, so the write from thread 0 overwrote the old value.

If the write from thread 1 happens-before the write from thread 0 then thread 1 will eventually see turn as 1 (since the last write is by thread 0), and thus thread 1 will acquire the lock. On the other hand, if the write to turn from thread 0 happens-before the write to turn from thread 1, then thread 0 will eventually see the turn as 0 and acquire the lock. Therefore, for thread 0 to be stuck waiting the last write to turn must have been by thread 0, which implies thread 1 will eventually get the lock.

Therefore, Dmitriy's implementation works.

Differences, and conclusion

The key difference between the implementations other than the naming of the variables is which variable the exchange operation is applied to. In Bartosz's implementation, the exchange is applied to _interested[me], which is only ever written by one thread for a given value of me. In Dmitriy's implementation, the exchange is applied to the turn variable, which is the variable updated by both threads. It therefore acts as a synchronization point for the threads. This is the key to the whole algorithm — even though many of the operations in Dmitriy's implementation use std::memory_order_relaxed, whereas Bartosz's implementation uses std::memory_order_acquire and std::memory_order_release everywhere, the single std::memory_order_acq_rel on the exchange on the right variable is enough.

I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, C++0x, atomics, memory model, synchronization
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

20 Comments

IMO the Dmitry' implementation cannot be called the Peterson's lock. If you have atomic exchange, you could implement usual spinlock without cross-busy flags. The point of the algorithm is to use plain loads and stores to achieve the synchronization.

I see how to convert the fully sequentially consistent algorithm into e.g. something which uses some relaxed ops and one seq_cst fence in the locking path, but it is interesting is it possible to have the real Peterson's lock without seq_cst, using more relaxed model.

by kib at 15:00:33 on Monday, 21 January 2019

Hello Anthony,

I have one question regarding Bartosz's implementation. Could you please point the mistake in my logic?

> We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty But why this cannot be _interested[0] itself? There is store to _interested[0] in thread 0 and load from _interested[0] in thread 1.

An operation A happens-before an operation B if A synchronizes-with B. An operation A synchronizes-with an operation B if A is a store to some atomic variable m, with an ordering of std::memory_order_release, B is a load from the same variable m, with an ordering of std::memory_order_acquire, and B reads the value stored by A.

So A (store to _interested[0] in thread 0) happens before B (load from _interested[0] in thread 1).

If I try to think in low level terms I don't understand either. Load with acquire semantics generates memory fence, which is supposed to invalidate all cache lines in CPU-1, while memory fence, generated by store with release semantics in CPU-0 is supposed to empty store buffer.

How can it read the old value?

Thanks in advance.

by Tigran at 15:00:33 on Monday, 21 January 2019

Hi, very appreciate your post! I have two questions.

1. I think in Dmitriy's implementation, the acquire/release with flag0 and flag1 are unnecessary, right ? 2. On x64, does memory_order_acq_rel will turn into a "mfence" instruction ? I compiled and disassembled, but not found it, why ?

Thanks a lot

by derek at 15:00:33 on Monday, 21 January 2019

Sorry, didn't notice David's comment and your answer.

Your answer is

"Synchronizes with" does not affect ordering on a single variable. If operations on one variable synchronize-with each other, that imposes orderings on accesses to *other* variables.

But you said, that An operation A happens-before an operation B if ... A synchronizes-with B. So this is not "Synchronizes with", this is happens before. I get something wrong, but cannot understand what exactly..

I understand that as if A Synchronizes with B then A happens before B. So in this specific situation synchronizes with and happens before is the same thing. Is it right?

by Tigran at 15:00:33 on Monday, 21 January 2019

Hi Derek,

1. The acquire/release on the flag0 and flag1 variables are necessary to ensure that it acts as a lock: the release store in the unlock synchronizes with the acquire-load in the next lock, to ensure that the data modified while the lock was held is now visible to the second thread.

2. What memory_order_acq_rel does depends on the operation. In this case, it is applied to turn.exchange, which maps to an XCHG instruction on x64. This already has memory_order_acq_rel semantics in the hardware, so the memory ordering is only necessary to inform the compiler, and prevent it moving operations around in a way that would violate the constraint.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

Got it. C++ Concurrency in Action explains that relations in great details. Thanks very much.

by Tigran at 15:00:33 on Monday, 21 January 2019

by derek at 15:00:33 on Monday, 21 January 2019

very appreciate your reply! very clear! I have written a peterson lock in C and have some question about data modification in lock visibility. seems this webpage doesn't support a too long comments, so I asked in stackoverflow. please take a look : )

http://stackoverflow.com/questions/43444676/peterson-algorithm-about-data-modification-in-lock-visible-to-second-thread-with

very appreciate! Thanks again !

by derek at 15:00:33 on Monday, 21 January 2019

You write: "We therefore need to find a variable which was stored by thread 0, and loaded by thread 1...". I think that sentence would be a LOT clearer if you replaced the phrase "a variable" with the phrase "another variable". Is that right?

I think what you're saying is that if we want to prove that Thread 0's store(A) will be visible during Thread 1's load(A), we need to find some *other* variable B != A such that a store_and_release(B) appears after store(A) in Thread 0 and also a load_and_acquire(B) appears before load(A) in Thread 1. And even then, we can't say that the load will definitely reflect the store; all we can say is that *if* Thread 0 has executed the store_and_release(B) before Thread 1 gets to its load_and_acquire(B), then Thread 1's load(A) will be able to see Thread 0's store(A).

But I admit that I still get confused whenever I try to think through an example independently. I have to keep reminding myself that the issue is not whether store(A) *actually happens* before load(A). We're assuming that the store has in fact happened. What matters is whether Thread 1 *knows* that the store has happened. And that depends on a lot of other loads and stores.

by Arthur at 15:00:33 on Monday, 21 January 2019

Hi Anthony, thanks very much for a great article.

Regarding the concluding statement "the single std::memory_order_acq_rel on the exchange on the right variable is enough": for x86_64, would std::memory_order_relaxed be sufficient here? I ask, as x86_64's xchg is an automatically locked, serializing instruction (I believe), which means that the call to exchange will always be a sync point?

by Carl at 15:00:33 on Monday, 21 January 2019

@Carl: The memory ordering constraints do more than affect the choice of instruction: they limit compiler optimizations too. Relaxed instructions can be moved around by the compiler relative to other operations, whereas acquire-release operations cannot. If you want acquire-release, say so.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

Thanks Anthony... I really like your work!

by Carl at 15:00:33 on Monday, 21 January 2019

Hello Anthony, I am still having trouble with proving Dmitriy's implementation right, which im sure stems from a wrong understanding of the memory model: what prevents the read-modify-write from reading the "old" value of the variable turn? Let's say Thread 1 does not "see" the values written by thread 0 due to the store buffer. What forces the visibility of the proper value of "turn"? (and therefore of flag since because of acq-rel if value of turn written by Thread 0 has "propagated", then value of flag0 has propagated to thread 1 as well). I noticed in the standard a paragraph in "29.3 Order and consistency" (paragraph 12) which states : "Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation." Does this paragraph mean the "exchange" call has to see the last value, and therefore we will see the right value of flag0? I apologize in advance if I was unclear.

by JJ at 15:00:33 on Monday, 21 January 2019

If I may clarify, is

"turn.exchange(0, std::memory_order_acq_rel);"

sensibly equivalent to :(imagine no preemption so atomicity is conserved)

"var localTurn = turn.load(std::memory_order_acq); turn.store(1, std::memory_order_rel);"

or do CAS operations have a more "special" property (as hinted by the paragraph I quoted)

by JJ at 15:00:33 on Monday, 21 January 2019

Just found this which might help future comment reader : http://bartoszmilewski.com/2008/12/23/the-inscrutable-c-memory-model/

by JJ at 15:00:33 on Monday, 21 January 2019

Hello Anthony,

I'm a beginner in memory ordering and I'm trying to understand your proof. I hope my questions are not too stupid.

The part I don't get in your handling of Bartosz's implementation is "We have to prove that the store to _interested[0] from thread 0 happens-before the load from thread 1. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty".

If I understand correctly, "happens-before" means "A synchronizes-with B" in that particular case. This means (1) - A is a store with an ordering of std::memory_order_release, and (2) - B is a load with an ordering of std::memory_order_acquire

Clearly in thread 0 there is a store to _interested[0] : _interested[0].exchange(true, std::memory_order_acq_rel);

If memory_order_acq_rel has both acquire and release semantic, then I believe it matches (1).

Then in thread 1, there is a load from _interested[0] in the while loop : while (_interested[0].load(std::memory_order_acquire) ...

Which seem to matches (2)

So can you tell me where my reasoning is wrong and why I should discard _interested[0] as a synchronization variable ?

Then in conclusion you explain : "In Bartosz's implementation, the exchange is applied to _interested[me], which is only ever written by one thread for a given value of me. In Dmitriy's implementation, the exchange is applied to the turn variable, which is the variable updated by both threads. It therefore acts as a synchronization point for the threads."

This seem to imply that in order to "synchronizes-with", a load and a store is not enough, you have to perform 2 stores ?

I'm confused, please help me understand.

David

by David at 15:00:33 on Monday, 21 January 2019

David:

You are right that the store does synchronize-with the load *if the load sees the value stored*.

However, the problem is that the load *might not see the store* unless there is some *other* cause for synchronization. Even though thread 0 has stored true to _interested[0], thread 1 might still read false due to the vagaries of caching and the lack of explicit memory ordering. If thread 1 reads false then it breaks out of the while loop, potentially prematurely.

"Synchronizes with" does not affect ordering on a single variable. If operations on one variable synchronize-with each other, that imposes orderings on accesses to *other* variables.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

In Dmitriy's implementation, would it still be correct to use 'turn.store(1)' instead of 'turn.exchange(1, std::memory_order_acq_rel)'?

by Serg at 09:12:37 on Thursday, 21 March 2019

No, it would not. The exchange is part of the synchronization mechanism: it ensures that the other thread sees the store to the relevant flag value if it sees the store to turn.

by Anthony Williams at 09:20:51 on Thursday, 21 March 2019

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog

Blog Archives

Peterson's lock with C++0x atomics

Friday, 05 December 2008

C++0x memory ordering recap

Bartosz's implementation

If thread 0 successfully acquires the lock, then thread 1 will not do so

Dmitriy's implementation

If thread 0 successfully acquires the lock, then thread 1 will not do so

If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock

If thread 0 fails to acquire the lock, then thread 1 does so

Differences, and conclusion

20 Comments

Add your comment