Multithreading in C++0x part 6: Lazy initialization and double-checked locking with atomics
Thursday, 13 August 2009
This is the sixth in a series of blog posts introducing the new C++0x thread library. So far we've looked at the various ways of starting threads in C++0x and protecting shared data with mutexes. See the end of this article for a full set of links to the rest of the series.
I had intended to write about the use of the new std::lock()
for avoiding deadlock in this article. However, there was a post on
comp.std.c++ this morning about lazy initialization, and I thought I'd
write about that instead. std::lock()
can wait until next time.
Lazy Initialization
The classic lazy-initialization case is where you have an object that is expensive to construct but isn't always needed. In this case you can choose to only initialize it on first use:
class lazy_init { mutable std::unique_ptr<expensive_data> data; public: expensive_data const& get_data() const { if(!data) { data.reset(new expensive_data); } return *data; } };
However, we can't use this idiom in multi-threaded code, since
there would be a data race on the accesses to data
. Enter std::call_once()
— by using an instance of std::once_flag
to protect the initialization we can make the data race go away:
class lazy_init { mutable std::once_flag flag; mutable std::unique_ptr<expensive_data> data; void do_init() const { data.reset(new expensive_data); } public: expensive_data const& get_data() const { std::call_once(flag,&lazy_init::do_init,this); return *data; } };
Concurrent calls to get_data()
are now safe: if the
data has already been initialized they can just proceed
concurrently. If not, then all threads calling concurrently except one
will wait until the remaining thread has completed the
initialization.
Reinitialization
This is all very well if you only want to initialize the
data once. However, what if you need to update the data
— perhaps it's a cache of some rarely-changing data
that's expensive to come
by. std::call_once()
doesn't support multiple calls (hence the name). You could of course
protect the data with a mutex, as shown below:
class lazy_init_with_cache { mutable std::mutex m; mutable std::shared_ptr<const expensive_data> data; public: std::shared_ptr<const expensive_data> get_data() const { std::lock_guard<std::mutex> lk(m); if(!data) { data.reset(new expensive_data); } return data; } void invalidate_cache() { std::lock_guard<std::mutex> lk(m); data.reset(); } };
Note that in this case we return a std::shared_ptr<const
expensive_data>
rather than a reference to avoid a race
condition on the data itself — this ensures that the copy held
by the calling code will be valid (if out of date) even if another
thread calls invalidate_cache()
before the data can be
used.
This "works" in the sense that it avoids data races, but if the
updates are rare and the reads are frequent then this may cause
unnecessary serialization when multiple threads
call get_data()
concurrently. What other options do we
have?
Double-checked locking returns
Much has been written about how double-checked locking is broken when using multiple threads. However, the chief cause of the problem is that the sample code uses plain non-atomic operations to check the flag outside the mutex, so is subject to a data race. You can overcome this by careful use of the C++0x atomics, as shown in the example below:
class lazy_init_with_cache { mutable std::mutex m; mutable std::shared_ptr<const expensive_data> data; public: std::shared_ptr<const expensive_data> get_data() const { std::shared_ptr<const expensive_data> result= std::atomic_load_explicit(&data,std::memory_order_acquire); if(!result) { std::lock_guard<std::mutex> lk(m); result=data; if(!result) { result.reset(new expensive_data); std::atomic_store_explicit(&data,result,std::memory_order_release); } } return result; } void invalidate_cache() { std::lock_guard<std::mutex> lk(m); std::shared_ptr<const expensive_data> dummy; std::atomic_store_explicit(&data,dummy,std::memory_order_relaxed); } };
Note that in this case, all writes to data
use
atomic operations, even those within the mutex lock. This is
necessary in order to ensure that the atomic load operation at the
start of get_data()
actually has a coherent value to read
— there's no point doing an atomic load if the stores are not
atomic, otherwise you might atomically load some half-written
data. Also, the atomic load and store operations ensure that the
reference count on the std::shared_ptr
object is
correctly updated, so that the expensive_data
object is
correctly destroyed when the last std::shared_ptr
object
referencing it is destroyed.
If our atomic load actually returned a non-NULL
value
then we can use that, just as we did before. However, if it
returned NULL
then we need to lock the mutex and try
again. This time we can use a plain read of data
, since
the mutex is locked. If we still get NULL
then we need to
do the initialization. However, we can't just
call data.reset()
like before, since that is not
atomic. Instead we must create a local std::shared_ptr
instance with the value we want, and store the value with an atomic
store operation. We can use result
for the local value,
since we want the value in that variable anyway.
In invalidate_cache()
we must also store the value
using std::atomic_store_explicit()
, in order to ensure
that the NULL
value is correctly read
in get_data()
. Note also that we must also lock the mutex
here, in order to avoid a data race with the initialization code
inside the mutex lock in get_data()
.
Memory ordering
By using std::atomic_load_explicit()
and std::atomic_store_explicit()
we can specify the
memory
ordering requirements of the operations. We could have just
used std::atomic_load()
and std::atomic_store()
, but those would have
implied std::memory_order_seq_cst
, which is overkill in
this scenario. What we need is to ensure that if a
non-NULL
value is read in get_data()
then
the actual creation of the associated object happens-before
the read. The store in get_data()
must therefore
use std::memory_order_release
, whilst the load
uses std::memory_order_acquire
.
On the other hand, the store in invalidate_cache()
can
merrily use std::memory_order_relaxed
, since there is
no data associated with the store: if the load
in get_data()
reads NULL
then the mutex
will be locked, which will handle any necessary synchronization.
Whenever you use atomic operations, you have to make sure that the memory ordering is correct, and that there are no races. Even in such a simple case such as this it is not trivial, and I would not recommend it unless profiling has shown that this is really a problem.
Update: As if to highlight my previous point about the
trickiness of atomic operations, Dmitriy correctly points out in the
comments that the use of std::shared_ptr
to access
the expensive_data
implies a reference count, which is a
real performance suck in multithreaded code. Whatever the memory
ordering constraints we put on it, every thread doing the reading has
to update the reference count. This is thus a source of contention,
and can seriously limit scalability even if it doesn't force full
serialization. The same issues apply to multiple-reader single-writer
mutexes — Joe Duffy has written about them
over
on his blog. Time it on your platform with just a mutex (i.e. no
double-checked locking), and with the atomic operations, and use
whichever is faster. Alternatively, use a memory reclamation scheme
specially tailored for your usage.
Next time
In the next part of this series I'll cover the use
of std::lock()
that I was intending to cover in this installment.
Subscribe to the RSS feed or email newsletter for this blog to be sure you don't miss the rest of the series.
Try it out
If you're using Microsoft Visual Studio 2008 or g++ 4.3 or 4.4 on
Ubuntu Linux you can try out the examples from this series using our
just::thread
implementation of the new C++0x thread library. Get your copy
today.
Note: since std::shared_ptr
is part of the library
supplied with the
compiler, just::thread
cannot provide the atomic functions for std::shared_ptr
used in this article.
Multithreading in C++0x Series
Here are the posts in this series so far:
- Multithreading in C++0x Part 1: Starting Threads
- Multithreading in C++0x Part 2: Starting Threads with Function Objects and Arguments
- Multithreading in C++0x Part 3: Starting Threads with Member Functions and Reference Arguments
- Multithreading in C++0x Part 4: Protecting Shared Data
- Multithreading
in C++0x Part 5: Flexible locking
with
std::unique_lock<>
- Multithreading in C++0x part 6: Lazy initialization and double-checked locking with atomics
- Multithreading in C++0x part 7: Locking multiple mutexes without deadlock
- Multithreading in C++0x part 8: Futures, Promises and Asynchronous Function Calls
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, multithreading, C++0x, thread, atomics
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2024 Just Software Solutions Ltd. All rights reserved. | Privacy Policy
12 Comments
Hi Dmitriy,
You are right to be concerned about the implementation of std::atomic_load_explicit for shared_ptr. The simplest implementation would use an internal spin lock, which is potentially marginally more efficient than a full-blown mutex, but will still enforce serialization.
Just because algorithms are patented doesn't mean implementations can't use them --- it just means they need to get permission from the patentee (presumably for a fee).
Regarding your second point about reference counting, you're probably right. I was trying to avoid having to deal with the issues around safely accessing the returned object, and atomic shared_ptr access was the simplest way I could think of this morning. It is likely to be considerably less efficient than the call_once scenario, but the circumstances are different.
I think the member function do_init in the 2nd example should be const.
Interesting artices!
SG
Thank you for writing this informative article. One wrinkle I think you are forgetting is that there are varying levels of thread safety. One definition of the lowest level of thread safety is when the following conditions are true:
1) It is safe for several threads to have simultaneous read-only access to the same object. 2) It is safe for several threads to have simultaneous read-write access to different objects.
If you only want to provide basic thread safety then you don't need to protect the invalidate_cache() method. This is a non-const method of the lazy_init_with_cache class, so if a client calls it on an object that is shared by several threads it is the client's responsibility to protect the code, probably with his own mutex or lock. On the other hand, the get_data() method is const, so a client can reasonably expect that it is safe to call this from several threads without locks. Thus it is your responsibility to make get_data() safe to call simultaneously from several threads, but not invalidate_data().
Well spotted SG. Thanks. I've updated the code.
Hi Joe,
I agree that when we talk about something being "thread-safe" it is important to think about what operations are being done - going for the general case will just lead to unnecessary overhead.
The reason I put the lock in invalidate_cache() is that it prevents a race between invalidate_cache() and get_data(). There's no point locking the mutex in get_data() if another thread can just call invalidate_cache() and stamp all over our data without acquiring the mutex.
If in your particular case the program logic guarantees that no threads can call get_data() whilst any thread is calling invalidate_cache() then you don't need the internal locks.
Hi Anthony,
At least POSIX guarantees that mutex lock and unlock provide memory barrier (as does thread creation). More specifically, unlock provides (all) memory release semantics while lock provides (all) memory acquire semantics. This way different threads using mutex lock and unlock consistently achieve coherent view of data protected by the mutex. Notice that mutex lock and unlock APIs do not provide a way to specify which memory gets affected, so they must operate in a way that affects any memory.
In this light, it is not necessary to use atomic operations while the mutex is locked, because mutex unlock does memory release anyway.
Another thing I'd like to mention is that in the first version of:
std::shared_ptr<const expensive_data> get_data() const { std::lock_guard<std::mutex> lk(m); if(!data) { data.reset(new expensive_data); } return data; // <---- copying here, the mutex is still held } // <---- release the mutex here
Is copying data to the return value while still holding the mutex, which is suboptimal, since the mutex only protects the initialisation of data. A more optimal version should no hold the mutex after data has been initialised:
std::shared_ptr<const expensive_data> get_data() const { { std::lock_guard<std::mutex> lk(m); if(!data) data.reset(new expensive_data); } // <---- release the mutex here return data; // <---- copying data, mutex is not held }
Max
Hi Max,
Firstly, you're right about the mutex-only version: once data is initialized then you can release the mutex before returning the value.
However, you are wrong about the double-checked locking example. Yes, mutex unlock provides full release semantics, so we do not normally need to use atomics with a mutex. However, we are reading the value *outside* the mutex lock, and only acquiring the lock if this read returns NULL. Other threads can perform this read *even whilst one thread has the mutex locked*. Consequently, we need release semantics on the *store itself*. The fact that the unlock that is executed moments later has release semantics is not enough --- moments later might be too late if a concurrently-executing thread has already read the freshly-stored non-NULL value and tried to use it.
Hi Anthony,
I see what you mean. If the store does not have release semantics there is a possibility that another thread starts using the object after the store but before the release, so that some of the effects of constructing a new object may not have become visible to the other thread. Is that right?
Max
Hi Max,
Yes, that's right.
Anthony,
Thanks for very informative post.
I'm trying to understand details of your explanation in the "Double-checked locking returns" section and interpolate it to a sort of Singleton application. AFAIU, you suggest the double-checked locking should basically consists of the three elements: atomic read, lock guard and atomic load within the guarded scope. Is that correct?
Based on your get_data function, here is attempt to replicate equivalent of get_data() function above, but based on plain Win32 API:
expensive_data* get_single_data() { static expensive_data* data; if (!::InterlockedCompareExchangePointer(reinterpret_cast<LPVOID*>(&data), nullptr, nullptr)) { std::lock_guard<std::mutex> lk(m); // Win32 API equivalent here if(!data) { expensive_data* tdata = new expensive_data(); //tdata ... fill expensive data if (::InterlockedCompareExchangePointer(reinterpret_cast<LPVOID*>(&data), tdata, nullptr)) delete tdata; } } return data; }
Would you consider it as correct and equivalent implementation of the double-checked locking variant explain above? (Let's forget about the obvious RAII issue here.)
Mat