Implementing Dekker's algorithm with Fences
Tuesday, 27 July 2010
Dekker's algorithm is one of the most basic algorithms for mutual exclusion, alongside Peterson's algorithm and Lamport's bakery algorithm. It has the nice property that it only requires load and store operations rather than exchange or test-and-set, but it still requires some level of ordering between the operations. On a weakly-ordered architecture such as PowerPC or SPARC, a correct implementation of Dekker's algorithm thus requires the use of fences or memory barriers in order to ensure correct operation.
The code
For those of you who just want the code: here it is — Dekker's algorithm in C++, with explicit fences.
std::atomic<bool> flag0(false),flag1(false); std::atomic<int> turn(0); void p0() { flag0.store(true,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_seq_cst); while (flag1.load(std::memory_order_relaxed)) { if (turn.load(std::memory_order_relaxed) != 0) { flag0.store(false,std::memory_order_relaxed); while (turn.load(std::memory_order_relaxed) != 0) { } flag0.store(true,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_seq_cst); } } std::atomic_thread_fence(std::memory_order_acquire); // critical section turn.store(1,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_release); flag0.store(false,std::memory_order_relaxed); } void p1() { flag1.store(true,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_seq_cst); while (flag0.load(std::memory_order_relaxed)) { if (turn.load(std::memory_order_relaxed) != 1) { flag1.store(false,std::memory_order_relaxed); while (turn.load(std::memory_order_relaxed) != 1) { } flag1.store(true,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_seq_cst); } } std::atomic_thread_fence(std::memory_order_acquire); // critical section turn.store(0,std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_release); flag1.store(false,std::memory_order_relaxed); }
The analysis
If you're like me then you'll be interested in why stuff works, rather than just taking the code. Here is my analysis of the required orderings, and how the fences guarantee those orderings.
Suppose thread 0 and thread 1 enter p0
and p1
respectively at the same time. They both set their
respective flags to true
, execute the fence and then read
the other flag at the start of the while
loop.
If both threads read false
then both will enter the
critical section, and the algorithm doesn't work. It is the job of the
fences to ensure that this doesn't happen.
The fences are marked with memory_order_seq_cst
, so either the
fence in p0
is before the fence in p1
in the global ordering of
memory_order_seq_cst
operations, or vice-versa. Without
loss of generality, we can assume that the fence in p0
comes before the fence in p1
, since the code is
symmetric. The store to flag0
is sequenced before the
fence in p0
, and the fence in p1
is
sequenced before the read from flag0
. Therefore the
read from flag0
must see the value stored
(true
), so p1
will enter
the while
loop.
On the other side, there is no such guarantee for the read
from flag1
in p0
, so p0
may or
may not enter the while
loop. If p0
reads
the value of false
for flag1
, it will not
enter the while
loop, and will instead enter the critical
section, but that is OK since p1
has entered
the while
loop.
Though flag0
is not set to false
until p0
has finished the critical section, we need to
ensure that p1
does not see this until the values
modified in the critical section are also visible to p1
,
and this is the purpose of the release fence prior to the store
to flag0
and the acquire fence after
the while
loop. When p1
reads the
value false
from
flag0
in order to exit the while
loop, it
must be reading the value store by p0
after the release
fence at the end of the critical section. The acquire fence after the
load guarantees that all the values written before the release fence
prior to the store are visible, which is exactly what we need
here.
If p0
reads true
for flag1
,
it will enter the while
loop rather than the critical
section. Both threads are now looping, so we need a way to ensure
that exactly one of them breaks out. This is the purpose of
the turn
variable. Initially, turn is 0,
so p1
will enter the if
and
set flag1
to false
, whilst p1
will not enter the if
. Because p1
set
flag1
to false
, eventually p0
will read flag1
as false
and exit the
outer while
loop to enter the critical section. On the
other hand, p1
is now stuck in the
inner while
loop because turn
is
0. When p0
exits the critical section it
sets turn
to 1. This will eventually be seen
by p1
, allowing it to exit the inner while
loop. When the store to
flag0
becomes visible p1
can then exit the
outer while
loop too.
If turn
had been 1 initially (because p0
was the last thread to enter the critical section) then the inverse
logic would apply, and p0
would enter the inner loop,
allowing p1
to enter the critical section first.
Second time around
If p0
is called a second time whilst p1
is still in the inner loop then we have a similar situation to the
start of the function — p1
may exit the inner
loop and store true
in flag1
whilst p0
stores true
in flag0
. We therefore need the
second memory_order_seq_cst
fence after the store to
the flag in the inner loop. This guarantees that
at least one of the threads will see the flag from the other thread
set when it executes the check in the outer loop. Without this fence
then both threads can read false
, and both can enter
the critical section.
Alternatives
You could put the ordering constraints on the loads and stores
themselves rather than using fences. Indeed, the default memory
ordering for atomic operations in C++
is memory_order_seq_cst
, so the algorithm would "just
work" with plain loads and stores to atomic variables. However, by
using memory_order_relaxed
on the loads and stores we
can add fences to ensure we have exactly the ordering constraints
required.
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, synchronization, Dekker, fences, cplusplus
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2024 Just Software Solutions Ltd. All rights reserved. | Privacy Policy
9 Comments
You should be able to call p0 multiple times from the same thread, or similarly for p1.
Then the algorithm looks suspicious. There is a seq_cst fence between store to flag0 and load form flag1 on main path. However there is no such fence on contention path. This can result in the following execution sequence: T0: (contention path) flag0 = true; load from flag1;
T1: (main path) flag1 = true; seq_cst fence; load from flag0;
Then both T0 can load false from flag1, and T1 load false from flag0. What I am missing?
Oops. I missed the fence after the store in the contention path. I've updated the code and added explanation.
Thanks for the code :)
I thought the draft C++0x standard rules around seq_cst imposed no ordering constraints between non-atomic (or relaxed) loads and stores. If that is the case, how does adding an explicit seq_cst fence (say, between the flag0.store and flag1.load in p0) add a store-load memory barrier?
I always thought this was a terrible gap in the standard that only makes composing SC and no-SC code impossible to reason about. If I place an explicit seq_cst fence (not attached to a particular variable) then I would expect it to insert a MEMBAR #StoreLoad | #LoadLoad | #LoadStore | #StoreStore assuming SPARC RMO. That's how it should work. But I thought the standard did not allow it. What changed?
A simple example is:
assume x == y == 0 and r1 == r2 == 0
Thread 0: x.store (1, memory_order_relaxed); atomic_thread_fence (memory_order_seq_cst); r1 = y.load (memory_order_relaxed);
Thread 1: y.store (1, memory_order_relaxed); atomic_thread_fence (memory_order_seq_cst); r2 = x.load (memory_order_relaxed);
Note here I'm not using acquire-release semantics. Now I would expect that since the fences are globally ordered that it must NEVER be possible to see r1 == r2 == 0 after these threads execute.
For that matter, I could mark the loads and store here with acquire-release respectively and the result would still be the same. At least to me, that's intuitive behavior. But as I understood it, the standard did not guarantee this unless I missed an update somewhere.
Anthony, I reviewed your manuscript but I don't recall if you clear this up. Could you clear this up somewhere in the section on atomic and maybe add this example and walk through it very carefully. I think this is the minimum code possible to show why seq_cst is necessary (intuitively at least) and it would be useful to walk through the above proof.
Contrast that code with this using only acquire-release semantics:
assume x == y == 0 and r1 == r2 == 0
Thread 0: x.store (1, memory_order_release); #A r1 = y.load (memory_order_acquire); #B
Thread 1: y.store (1, memory_order_release); #C r2 = x.load (memory_order_acquire); #D
Now, it's clear that r1 == r2 == 0 is possible. For a simple existence proof, assume Thread 1 observes A and B in the order {B, A}. One possible global ordering is {B, C, D, A}. Obviously, Thread 1 must be self-consistent so {C, D} ordering must be preserved. But in this case Thread 1 observed A and B out of order and this resulted in r1 == r2 == 0.
I really think a simple example like this would demonstrate the differences between one-sided fences across a pair of threads and two-sided fences on a single thread and why both models are necessary depending on whether you need the #StoreLoad.
Wow, if it doesn't work this way I'd hate to be a compiler vendor supporting the C++11 memory model. ;-)
@Michael: I realize this comes six years too late; but indeed the result r1 == r2 == 0 is not possible when using sc fences. One can verify this with cppmem with this code snippet:
int main() { atomic_int x = 0; atomic_int y = 0; {{{ { x.store(1, memory_order_relaxed); fence(memory_order_seq_cst); r1 = y.load(memory_order_relaxed); } ||| { y.store(1, memory_order_relaxed); fence(memory_order_seq_cst); r2 = x.load(memory_order_relaxed); } }}}; return 0; }
This gives 4 consistent (with the C++11 memory model) "executions": r1=0, r2=1; r1=1, r2=0; and two with r1=1, r2=1.
The reason that r1=0, r2=0 is not possible is because one fence must be sequenced before the other fence in an order that ALL threads agree on; hence connecting the chain (for example): x=0 sequenced before x=1 sequenced before fence 1, *sequenced before fence 2*, sequenced before read of x. So that x must be 1 and cannot be 0.
Hello,
I know this post is a little old but i hope you will answer some of my question :)
The part of your analysis i do not fully understand is this one :
"On the other side, there is no such guarantee for the read from flag1 in p0, so p0 may or may not enter the while loop. If p0 reads the value of false for flag1, it will not enter the while loop, and will instead enter the critical section, but that is OK since p1 has entered the while loop."
From my understanding, the std::memory_order_seq_cst fence will operate as a MFENCE, data after the fence will be up to date. So p1 will store its value and the fence will operate on the store buffer to update the value on thread 0 (p0) cache. Therefore, p0 should see the value ?
Can you tell me were i'm wrong ?
Sincerely
In the example being discussed, we "know" that the fence in p0 occurs before the fence in p1. However, we have no information about whether the read from flag1 in p0 comes before or after the write to flag1 from p1 becomes visible to p0. It would be entirely possible for p0 to run to completion before p1's store to flag1 became visible.
A fence with memory_order_seq_cst does not enforce immediate visibility to other threads (and neither does an MFENCE instruction). The C++0x memory ordering constraints are just that --- ordering constraints. memory_order_seq_cst operations form a total order, but there are no restrictions on what that order is, except that it must be agreed on by all threads, and it must not violate other ordering constraints. In particular, threads may continue to see "stale" values for some time, provided they see values in an order consistent with the constraints.