Implementing Dekker's algorithm with Fences

Tuesday, 27 July 2010

Dekker's algorithm is one of the most basic algorithms for mutual exclusion, alongside Peterson's algorithm and Lamport's bakery algorithm. It has the nice property that it only requires load and store operations rather than exchange or test-and-set, but it still requires some level of ordering between the operations. On a weakly-ordered architecture such as PowerPC or SPARC, a correct implementation of Dekker's algorithm thus requires the use of fences or memory barriers in order to ensure correct operation.

The code

For those of you who just want the code: here it is — Dekker's algorithm in C++, with explicit fences.

std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);

void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}

void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}

The analysis

If you're like me then you'll be interested in why stuff works, rather than just taking the code. Here is my analysis of the required orderings, and how the fences guarantee those orderings.

Suppose thread 0 and thread 1 enter p0 and p1 respectively at the same time. They both set their respective flags to true, execute the fence and then read the other flag at the start of the while loop.

If both threads read false then both will enter the critical section, and the algorithm doesn't work. It is the job of the fences to ensure that this doesn't happen.

The fences are marked with memory_order_seq_cst, so either the fence in p0 is before the fence in p1 in the global ordering of memory_order_seq_cst operations, or vice-versa. Without loss of generality, we can assume that the fence in p0 comes before the fence in p1, since the code is symmetric. The store to flag0 is sequenced before the fence in p0, and the fence in p1 is sequenced before the read from flag0. Therefore the read from flag0 must see the value stored (true), so p1 will enter the while loop.

On the other side, there is no such guarantee for the read from flag1 in p0, so p0 may or may not enter the while loop. If p0 reads the value of false for flag1, it will not enter the while loop, and will instead enter the critical section, but that is OK since p1 has entered the while loop.

Though flag0 is not set to false until p0 has finished the critical section, we need to ensure that p1 does not see this until the values modified in the critical section are also visible to p1, and this is the purpose of the release fence prior to the store to flag0 and the acquire fence after the while loop. When p1 reads the value false from flag0 in order to exit the while loop, it must be reading the value store by p0 after the release fence at the end of the critical section. The acquire fence after the load guarantees that all the values written before the release fence prior to the store are visible, which is exactly what we need here.

If p0 reads true for flag1, it will enter the while loop rather than the critical section. Both threads are now looping, so we need a way to ensure that exactly one of them breaks out. This is the purpose of the turn variable. Initially, turn is 0, so p1 will enter the if and set flag1 to false, whilst p1 will not enter the if. Because p1 set flag1 to false, eventually p0 will read flag1 as false and exit the outer while loop to enter the critical section. On the other hand, p1 is now stuck in the inner while loop because turn is 0. When p0 exits the critical section it sets turn to 1. This will eventually be seen by p1, allowing it to exit the inner while loop. When the store to flag0 becomes visible p1 can then exit the outer while loop too.

If turn had been 1 initially (because p0 was the last thread to enter the critical section) then the inverse logic would apply, and p0 would enter the inner loop, allowing p1 to enter the critical section first.

Second time around

If p0 is called a second time whilst p1 is still in the inner loop then we have a similar situation to the start of the function — p1 may exit the inner loop and store true in flag1 whilst p0 stores true in flag0. We therefore need the second memory_order_seq_cst fence after the store to the flag in the inner loop. This guarantees that at least one of the threads will see the flag from the other thread set when it executes the check in the outer loop. Without this fence then both threads can read false, and both can enter the critical section.

Alternatives

You could put the ordering constraints on the loads and stores themselves rather than using fences. Indeed, the default memory ordering for atomic operations in C++ is memory_order_seq_cst, so the algorithm would "just work" with plain loads and stores to atomic variables. However, by using memory_order_relaxed on the loads and stores we can add fences to ensure we have exactly the ordering constraints required.

Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, synchronization, Dekker, fences, cplusplus
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

9 Comments

Is it intended for several successive acquisitions? Or just one-shot acquisition by T1 and T2?

by Dmitry V'jukov at 15:00:33 on Monday, 21 January 2019

You should be able to call p0 multiple times from the same thread, or similarly for p1.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

Then the algorithm looks suspicious. There is a seq_cst fence between store to flag0 and load form flag1 on main path. However there is no such fence on contention path. This can result in the following execution sequence: T0: (contention path) flag0 = true; load from flag1;

T1: (main path) flag1 = true; seq_cst fence; load from flag0;

Then both T0 can load false from flag1, and T1 load false from flag0. What I am missing?

by Dmitry V'jukov at 15:00:33 on Monday, 21 January 2019

Oops. I missed the fence after the store in the contention path. I've updated the code and added explanation.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

Thanks for the code :)

by issue tracking at 15:00:33 on Monday, 21 January 2019

I thought the draft C++0x standard rules around seq_cst imposed no ordering constraints between non-atomic (or relaxed) loads and stores. If that is the case, how does adding an explicit seq_cst fence (say, between the flag0.store and flag1.load in p0) add a store-load memory barrier?

I always thought this was a terrible gap in the standard that only makes composing SC and no-SC code impossible to reason about. If I place an explicit seq_cst fence (not attached to a particular variable) then I would expect it to insert a MEMBAR #StoreLoad | #LoadLoad | #LoadStore | #StoreStore assuming SPARC RMO. That's how it should work. But I thought the standard did not allow it. What changed?

A simple example is:

assume x == y == 0 and r1 == r2 == 0

Thread 0: x.store (1, memory_order_relaxed); atomic_thread_fence (memory_order_seq_cst); r1 = y.load (memory_order_relaxed);

Thread 1: y.store (1, memory_order_relaxed); atomic_thread_fence (memory_order_seq_cst); r2 = x.load (memory_order_relaxed);

Note here I'm not using acquire-release semantics. Now I would expect that since the fences are globally ordered that it must NEVER be possible to see r1 == r2 == 0 after these threads execute.

For that matter, I could mark the loads and store here with acquire-release respectively and the result would still be the same. At least to me, that's intuitive behavior. But as I understood it, the standard did not guarantee this unless I missed an update somewhere.

Anthony, I reviewed your manuscript but I don't recall if you clear this up. Could you clear this up somewhere in the section on atomic and maybe add this example and walk through it very carefully. I think this is the minimum code possible to show why seq_cst is necessary (intuitively at least) and it would be useful to walk through the above proof.

Contrast that code with this using only acquire-release semantics:

assume x == y == 0 and r1 == r2 == 0

Thread 0: x.store (1, memory_order_release); #A r1 = y.load (memory_order_acquire); #B

Thread 1: y.store (1, memory_order_release); #C r2 = x.load (memory_order_acquire); #D

Now, it's clear that r1 == r2 == 0 is possible. For a simple existence proof, assume Thread 1 observes A and B in the order {B, A}. One possible global ordering is {B, C, D, A}. Obviously, Thread 1 must be self-consistent so {C, D} ordering must be preserved. But in this case Thread 1 observed A and B out of order and this resulted in r1 == r2 == 0.

I really think a simple example like this would demonstrate the differences between one-sided fences across a pair of threads and two-sided fences on a single thread and why both models are necessary depending on whether you need the #StoreLoad.

Wow, if it doesn't work this way I'd hate to be a compiler vendor supporting the C++11 memory model. ;-)

by Michael Champigny at 15:00:33 on Monday, 21 January 2019

@Michael: I realize this comes six years too late; but indeed the result r1 == r2 == 0 is not possible when using sc fences. One can verify this with cppmem with this code snippet:

int main() { atomic_int x = 0; atomic_int y = 0; {{{ { x.store(1, memory_order_relaxed); fence(memory_order_seq_cst); r1 = y.load(memory_order_relaxed); } ||| { y.store(1, memory_order_relaxed); fence(memory_order_seq_cst); r2 = x.load(memory_order_relaxed); } }}}; return 0; }

This gives 4 consistent (with the C++11 memory model) "executions": r1=0, r2=1; r1=1, r2=0; and two with r1=1, r2=1.

The reason that r1=0, r2=0 is not possible is because one fence must be sequenced before the other fence in an order that ALL threads agree on; hence connecting the chain (for example): x=0 sequenced before x=1 sequenced before fence 1, *sequenced before fence 2*, sequenced before read of x. So that x must be 1 and cannot be 0.

by Carlo Wood at 15:00:33 on Monday, 21 January 2019

Hello,

I know this post is a little old but i hope you will answer some of my question :)

The part of your analysis i do not fully understand is this one :

"On the other side, there is no such guarantee for the read from flag1 in p0, so p0 may or may not enter the while loop. If p0 reads the value of false for flag1, it will not enter the while loop, and will instead enter the critical section, but that is OK since p1 has entered the while loop."

From my understanding, the std::memory_order_seq_cst fence will operate as a MFENCE, data after the fence will be up to date. So p1 will store its value and the fence will operate on the store buffer to update the value on thread 0 (p0) cache. Therefore, p0 should see the value ?

Can you tell me were i'm wrong ?

Sincerely

by bitset at 15:00:33 on Monday, 21 January 2019

In the example being discussed, we "know" that the fence in p0 occurs before the fence in p1. However, we have no information about whether the read from flag1 in p0 comes before or after the write to flag1 from p1 becomes visible to p0. It would be entirely possible for p0 to run to completion before p1's store to flag1 became visible.

A fence with memory_order_seq_cst does not enforce immediate visibility to other threads (and neither does an MFENCE instruction). The C++0x memory ordering constraints are just that --- ordering constraints. memory_order_seq_cst operations form a total order, but there are no restrictions on what that order is, except that it must be agreed on by all threads, and it must not violate other ordering constraints. In particular, threads may continue to see "stale" values for some time, provided they see values in an order consistent with the constraints.

by Anthony Williams at 15:00:33 on Monday, 21 January 2019

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog

Blog Archives

Implementing Dekker's algorithm with Fences

Tuesday, 27 July 2010

The code

The analysis

Second time around

Alternatives

9 Comments

Add your comment