Peterson's lock with C++0x atomics
Friday, 05 December 2008
Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:
"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."
I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
This is something I wholeheartedly agree with. If you weren't
convinced by my previous post on Memory
Models and Synchronization, maybe the proof below will convince
you to stick to memory_order_seq_cst
(the default) unless
you really need to do otherwise.
C++0x memory ordering recap
In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:
- Synchronizes-with
- An operation A synchronizes-with an
operation B if A is a store to some
atomic variable
m
, with an ordering ofstd::memory_order_release
, orstd::memory_order_seq_cst
, B is a load from the same variablem
, with an ordering ofstd::memory_order_acquire
orstd::memory_order_seq_cst
, and B reads the value stored by A. - Happens-before
- An operation A happens-before an
operation B if:
- A is performed on the same thread as B, and A is before B in program order, or
- A synchronizes-with B, or
- A happens-before some other operation C, and C happens-before B.
std::memory_order_consume
, but this is enough for now.
If all your operations use std::memory_order_seq_cst
,
then there is the additional constraint of total ordering, as I
mentioned before, but neither of the implementations in question use
any std::memory_order_seq_cst
operations, so we
can leave that aside for now.
Now, let's look at the implementations.
Bartosz's implementation
I've extracted the code for Bartosz's implementation from his posts, and it is shown below:
class Peterson_Bartosz { private: // indexed by thread ID, 0 or 1 std::atomic<bool> _interested[2]; // who's yielding priority? std::atomic<int> _victim; public: Peterson_Bartosz() { _victim.store(0, std::memory_order_release); _interested[0].store(false, std::memory_order_release); _interested[1].store(false, std::memory_order_release); } void lock() { int me = threadID; // either 0 or 1 int he = 1 ? me; // the other thread _interested[me].exchange(true, std::memory_order_acq_rel); _victim.store(me, std::memory_order_release); while (_interested[he].load(std::memory_order_acquire) && _victim.load(std::memory_order_acquire) == me) continue; // spin } void unlock() { int me = threadID; _interested[me].store(false,std::memory_order_release); } }
There are three things to prove with Peterson's lock:
- If thread 0 successfully acquires the lock, then thread 1 will not do so;
- If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
- If thread 0 fails to acquire the lock, then thread 1 does so.
Let's look at each in turn.
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially _victim
is 0, and the
_interested
variables are both false
. The
call to lock()
from thread 0 will then set
_interested[0]
to true
, and
_victim
to 0.
The loop then checks _interested[1]
, which is still
false
, so we break out of the loop, and the lock is
acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets _interested[1]
to
true
, and _victim
to 1, and then enters the
while
loop. This is where the fun begins.
The first thing we check is _interested[0]
. Now, we
know this was set to true
in thread 0 as it acquired the
lock, but the important thing is: does the CPU running thread 1 know
that? Is it guaranteed by the memory model?
For it to be guaranteed by the memory model, we have to prove that
the store to _interested[0]
from thread 0
happens-before the load from thread 1. This is trivially true
if we read true
in thread 1, but that doesn't help: we
need to prove that we can't read false
. We
therefore need to find a variable which was stored by thread 0, and
loaded by thread 1, and our search comes up empty:
_interested[1]
is loaded by thread 1 as part of the
exchange
call, but it is not written by thread 0, and
_victim
is written by thread 1 without reading the value
stored by thread 0. Consequently, there is no ordering
guarantee on the read of _interested[0]
, and thread 1
may also break out of the while
loop and acquire
the lock.
This implementation is thus broken. Let's now look at Dmitriy's implementation.
Dmitriy's implementation
Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:
std::atomic<int> flag0(0),flag1(0),turn(0); void lock(unsigned index) { if (0 == index) { flag0.store(1, std::memory_order_relaxed); turn.exchange(1, std::memory_order_acq_rel); while (flag1.load(std::memory_order_acquire) && 1 == turn.load(std::memory_order_relaxed)) std::this_thread::yield(); } else { flag1.store(1, std::memory_order_relaxed); turn.exchange(0, std::memory_order_acq_rel); while (flag0.load(std::memory_order_acquire) && 0 == turn.load(std::memory_order_relaxed)) std::this_thread::yield(); } } void unlock(unsigned index) { if (0 == index) { flag0.store(0, std::memory_order_release); } else { flag1.store(0, std::memory_order_release); } }
So, how does this code fare?
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially the turn
, flag0
and
flag1
variables are all 0. The call to
lock()
from thread 0 will then set flag0
to
1, and turn
to 1. These variables are essentially
equivalent to the variables in Bartosz's implementation, but
turn
is set to 0 when _victim
is set to 1,
and vice-versa. That doesn't affect the logic of the code.
The loop then checks flag1
, which is still 0, so we
break out of the loop, and the lock is acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets flag1
to 1, and
turn
to 0, and then enters the while
loop. This is where the fun begins.
As before, the first thing we check is flag0
. Now, we
know this was set to 1 in thread 0 as it acquired the lock, but the
important thing is: does the CPU running thread 1 know that? Is it
guaranteed by the memory model?
Again, for it to be guaranteed by the memory model, we have to
prove that the store to flag0
from thread 0
happens-before the load from thread 1. This is trivially true
if we read 1 in thread 1, but that doesn't help: we need to prove that
we can't read 0. We therefore need to find a variable which
was stored by thread 0, and loaded by thread 1, as before.
This time our search is successful: turn
is set using
an exchange
operation, which is a read-modify-write
operation. Since it uses std::memory_order_acq_rel
memory
ordering, it is both a load-acquire and a store-release. If the load
part of the exchange
reads the value written by thread 0,
we're home dry: turn
is stored with a similar
exchange
operation with
std::memory_order_acq_rel
in thread 0, so the store from
thread 0 synchronizes-with the load from thread 1.
This means that the store to flag0
from thread 0
happens-before the exchange
on turn
in thread 1, and thus happens-before the load
in
the while
loop. The load
in the
while
loop thus reads 1 from flag0
, and
proceeds to check turn
.
Now, since the store to turn
from thread 0
happens-before the store from thread 1 (we're relying on that
for the happens-before relationship on flag0
,
remember), we know that the value to be read will be the value we
stored in thread 1: 0. Consequently, we keep looping.
OK, so if the store to turn
in thread 1 reads the
value stored by thread 0 then thread 1 will stay out of the lock, but
what if it doesn't read the value store by thread 0? In this
case, we know that the exchange
call from thread 0 must
have seen the value written by the exchange
in thread 1
(writes to a single atomic variable always become visible in the same
order for all threads), which means that the write to
flag1
from thread 1 happens-before the read in
thread 0 and so thread 0 cannot have acquired the lock. Since this was
our initial assumption (thread 0 has acquired the lock), we're home
dry — thread 1 can only acquire the lock if thread 0 didn't.
If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock
OK, so we've got as far as thread 0 acquiring the lock and thread 1
waiting. What happens if thread 0 now releases the lock? It does this
simply by writing 0 to flag0
. The while
loop
in thread 1 checks flag0
every time round, and breaks out
if the value read is 0. Therefore, thread 1 will eventually acquire
the mutex. Of course, there is no guarantee when it will
acquire the mutex — it might take arbitrarily long for the the
write to flag0
to make its way to thread 1, but it will
get there in the end. Since flag0
is never written by
thread 1, it doesn't matter whether thread 0 has already released the
lock when thread 1 starts waiting, or whether thread 1 is already
waiting — the while
loop will still terminate, and
thread 1 will acquire the lock in both cases.
That just leaves our final check.
If thread 0 fails to acquire the lock, then thread 1 does so
We've essentially already covered this when we checked that thread
1 doesn't acquire the lock if thread 0 does, but this time we're going
in reverse. If thread 0 doesn't acquire the lock, it is because it
sees flag1
as 1 and turn
as 1. Since
flag1
is only written by thread 1, if it is 1 then thread
1 must have at least called lock()
. If thread 1 has
called unlock
then eventually flag1
will be
read as 0, so thread 0 will acquire the lock. So, let's assume for now
that thread 1 hasn't got that far, so flag1
is still
1. The next check is for turn
to be 1. This is the value
written by thread 0. If we read it as 1 then either the write to
turn
from thread 1 has not yet become visible to thread
0, or the write happens-before the write by thread 0, so the
write from thread 0 overwrote the old value.
If the write from thread 1 happens-before the write from
thread 0 then thread 1 will eventually see turn
as 1
(since the last write is by thread 0), and thus thread 1 will acquire
the lock. On the other hand, if the write to turn
from
thread 0 happens-before the write to turn
from
thread 1, then thread 0 will eventually see the turn
as 0
and acquire the lock. Therefore, for thread 0 to be stuck waiting the
last write to turn
must have been by thread 0, which
implies thread 1 will eventually get the lock.
Therefore, Dmitriy's implementation works.
Differences, and conclusion
The key difference between the implementations other than the
naming of the variables is which variable the exchange
operation is applied to. In Bartosz's implementation, the
exchange
is applied to _interested[me]
,
which is only ever written by one thread for a given value of
me
. In Dmitriy's implementation, the
exchange
is applied to the turn
variable,
which is the variable updated by both threads. It therefore acts as a
synchronization point for the threads. This is the key to the whole
algorithm — even though many of the operations in Dmitriy's
implementation use std::memory_order_relaxed
, whereas
Bartosz's implementation uses std::memory_order_acquire
and std::memory_order_release
everywhere, the single
std::memory_order_acq_rel
on the exchange
on
the right variable is enough.
I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, C++0x, atomics, memory model, synchronization
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2024 Just Software Solutions Ltd. All rights reserved. | Privacy Policy
20 Comments
IMO the Dmitry' implementation cannot be called the Peterson's lock. If you have atomic exchange, you could implement usual spinlock without cross-busy flags. The point of the algorithm is to use plain loads and stores to achieve the synchronization.
I see how to convert the fully sequentially consistent algorithm into e.g. something which uses some relaxed ops and one seq_cst fence in the locking path, but it is interesting is it possible to have the real Peterson's lock without seq_cst, using more relaxed model.
Hello Anthony,
I have one question regarding Bartosz's implementation. Could you please point the mistake in my logic?
> We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty But why this cannot be _interested[0] itself? There is store to _interested[0] in thread 0 and load from _interested[0] in thread 1.
An operation A happens-before an operation B if A synchronizes-with B. An operation A synchronizes-with an operation B if A is a store to some atomic variable m, with an ordering of std::memory_order_release, B is a load from the same variable m, with an ordering of std::memory_order_acquire, and B reads the value stored by A.
So A (store to _interested[0] in thread 0) happens before B (load from _interested[0] in thread 1).
If I try to think in low level terms I don't understand either. Load with acquire semantics generates memory fence, which is supposed to invalidate all cache lines in CPU-1, while memory fence, generated by store with release semantics in CPU-0 is supposed to empty store buffer.
How can it read the old value?
Thanks in advance.
Hi, very appreciate your post! I have two questions.
1. I think in Dmitriy's implementation, the acquire/release with flag0 and flag1 are unnecessary, right ? 2. On x64, does memory_order_acq_rel will turn into a "mfence" instruction ? I compiled and disassembled, but not found it, why ?
Thanks a lot
Sorry, didn't notice David's comment and your answer.
Your answer is
"Synchronizes with" does not affect ordering on a single variable. If operations on one variable synchronize-with each other, that imposes orderings on accesses to *other* variables.
But you said, that An operation A happens-before an operation B if ... A synchronizes-with B. So this is not "Synchronizes with", this is happens before. I get something wrong, but cannot understand what exactly..
I understand that as if A Synchronizes with B then A happens before B. So in this specific situation synchronizes with and happens before is the same thing. Is it right?Hi Derek,
1. The acquire/release on the flag0 and flag1 variables are necessary to ensure that it acts as a lock: the release store in the unlock synchronizes with the acquire-load in the next lock, to ensure that the data modified while the lock was held is now visible to the second thread.
2. What memory_order_acq_rel does depends on the operation. In this case, it is applied to turn.exchange, which maps to an XCHG instruction on x64. This already has memory_order_acq_rel semantics in the hardware, so the memory ordering is only necessary to inform the compiler, and prevent it moving operations around in a way that would violate the constraint.
Got it. C++ Concurrency in Action explains that relations in great details. Thanks very much.
very appreciate your reply! very clear! I have written a peterson lock in C and have some question about data modification in lock visibility. seems this webpage doesn't support a too long comments, so I asked in stackoverflow. please take a look : )
http://stackoverflow.com/questions/43444676/peterson-algorithm-about-data-modification-in-lock-visible-to-second-thread-with
very appreciate! Thanks again !
You write: "We therefore need to find a variable which was stored by thread 0, and loaded by thread 1...". I think that sentence would be a LOT clearer if you replaced the phrase "a variable" with the phrase "another variable". Is that right?
I think what you're saying is that if we want to prove that Thread 0's store(A) will be visible during Thread 1's load(A), we need to find some *other* variable B != A such that a store_and_release(B) appears after store(A) in Thread 0 and also a load_and_acquire(B) appears before load(A) in Thread 1. And even then, we can't say that the load will definitely reflect the store; all we can say is that *if* Thread 0 has executed the store_and_release(B) before Thread 1 gets to its load_and_acquire(B), then Thread 1's load(A) will be able to see Thread 0's store(A).
But I admit that I still get confused whenever I try to think through an example independently. I have to keep reminding myself that the issue is not whether store(A) *actually happens* before load(A). We're assuming that the store has in fact happened. What matters is whether Thread 1 *knows* that the store has happened. And that depends on a lot of other loads and stores.
Hi Anthony, thanks very much for a great article.
Regarding the concluding statement "the single std::memory_order_acq_rel on the exchange on the right variable is enough": for x86_64, would std::memory_order_relaxed be sufficient here? I ask, as x86_64's xchg is an automatically locked, serializing instruction (I believe), which means that the call to exchange will always be a sync point?
@Carl: The memory ordering constraints do more than affect the choice of instruction: they limit compiler optimizations too. Relaxed instructions can be moved around by the compiler relative to other operations, whereas acquire-release operations cannot. If you want acquire-release, say so.
Thanks Anthony... I really like your work!
If I may clarify, is
"turn.exchange(0, std::memory_order_acq_rel);"
sensibly equivalent to :(imagine no preemption so atomicity is conserved)
"var localTurn = turn.load(std::memory_order_acq); turn.store(1, std::memory_order_rel);"
or do CAS operations have a more "special" property (as hinted by the paragraph I quoted)Hello Anthony,
I'm a beginner in memory ordering and I'm trying to understand your proof. I hope my questions are not too stupid.
The part I don't get in your handling of Bartosz's implementation is "We have to prove that the store to _interested[0] from thread 0 happens-before the load from thread 1. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty".
If I understand correctly, "happens-before" means "A synchronizes-with B" in that particular case. This means (1) - A is a store with an ordering of std::memory_order_release, and (2) - B is a load with an ordering of std::memory_order_acquire
Clearly in thread 0 there is a store to _interested[0] : _interested[0].exchange(true, std::memory_order_acq_rel);
If memory_order_acq_rel has both acquire and release semantic, then I believe it matches (1).
Then in thread 1, there is a load from _interested[0] in the while loop : while (_interested[0].load(std::memory_order_acquire) ...
Which seem to matches (2)
So can you tell me where my reasoning is wrong and why I should discard _interested[0] as a synchronization variable ?
Then in conclusion you explain : "In Bartosz's implementation, the exchange is applied to _interested[me], which is only ever written by one thread for a given value of me. In Dmitriy's implementation, the exchange is applied to the turn variable, which is the variable updated by both threads. It therefore acts as a synchronization point for the threads."
This seem to imply that in order to "synchronizes-with", a load and a store is not enough, you have to perform 2 stores ?
I'm confused, please help me understand.
David
David:
You are right that the store does synchronize-with the load *if the load sees the value stored*.
However, the problem is that the load *might not see the store* unless there is some *other* cause for synchronization. Even though thread 0 has stored true to _interested[0], thread 1 might still read false due to the vagaries of caching and the lack of explicit memory ordering. If thread 1 reads false then it breaks out of the while loop, potentially prematurely.
"Synchronizes with" does not affect ordering on a single variable. If operations on one variable synchronize-with each other, that imposes orderings on accesses to *other* variables.
In Dmitriy's implementation, would it still be correct to use 'turn.store(1)' instead of 'turn.exchange(1, std::memory_order_acq_rel)'?
No, it would not. The exchange is part of the synchronization mechanism: it ensures that the other thread sees the store to the relevant flag value if it sees the store to turn.