mikeash.com: just this guy, you know?

Posted at 2014-06-06 13:27 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2014-06-20: Interesting Swift Features
Previous article: Friday Q&A 2014-05-23: A Heartbleed-Inspired Paranoid Memory Allocator
Tags: atomic fridayqna gcd threading
Friday Q&A 2014-06-06: Secrets of dispatch_once
by Mike Ash  

Reader Paul Kim pointed out an entry on Michael Tsai's blog about making dispatch_once fast. While the comment in the dispatch_once source code is fascinating and informative, it doesn't quite delve into the detail that some would like to see. Since this is one of my favorite hacks, for today's article I'm going to discuss exactly what's going on there and how it all works.

The API
The behavior of dispatch_once is in the name. It does something once and only once.

It takes two parameters. The first is a predicate that tracks the "once". The second is a block to execute on the first call. Calls look like this:

    static dispatch_once_t predicate;
    dispatch_once(&predicate, ^{
        // some one-time task
    });

It's a great API for lazily initializing shared state, whether it's a global dictionary, a singleton instance, a cache, or anything else that needs a bit of setup the first time through.

In a single-threaded world, this call would be kind of boring and could be replaced with a simple if statement. However, we live in a multithreaded world, and dispatch_once is thread safe. It's guaranteed that multiple simultaneous calls to dispatch_once from multiple threads will only execute the block once, and all threads will wait until execution is complete before dispatch_once returns. Even that is not too hard to accomplish on your own, but dispatch_once is also extremely fast, and that is really tough to pull off.

Single-Threaded Implementation
Let's first look at a simplistic single-threaded implementation of the function. This is just about useless in a practical sense, but it's good to be have a concrete view of the semantics. Note that dispatch_once_t is just a typedef for long, initialized to zero, and with the meaning of other values left up to the implementation. Here's the function:

    void SimpleOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        if(!*predicate) {
            block();
            *predicate = 1;
        }
    }

The implementation is easy: if the predicate contains zero, call the block and set it to one. Subsequent calls will see the one and not call the block again. Exactly what we want, aside from being completely unsafe in a multithreaded environment. Two threads could hit the if statement simultaneously, causing them both to call the block, and that's bad. Unfortunately, as is often the case, making this code thread safe means a substantial performance hit.

Performance
When talking about the performance of dispatch_once, there are really three different scenarios to consider:

  1. The first ever call to dispatch_once with a given predicate, which executes the block.
  2. Calls to dispatch_once after the first call, but before the block finishes executing. Here, callers have to wait for the block before they proceed.
  3. Calls to dispatch_once after the first call and after the block has executed. No waiting is required and they can immediately proceed.

The performance of scenario #1 is largely unimportant, as long as it's not absurdly slow. It only happens once, after all.

The performance of scenario #2 is similarly unimportant. It could happen potentially several times, but it only happens while the block is executing. In most cases, it never happens at all. If it does, it probably only happens once. Even in a torture test where lots of threads call dispatch_once simultaneously and the block takes a long time to execute, the number of calls is still going to be limited to thousands. Those calls all have to wait for the block anyway, so it's not that big of a deal if they burn some unnecessary CPU time while they do it.

The performance of scenario #3 is enormously important. Calls of this nature could happen potentially millions or billions of times as a program executes. We want to be able to use dispatch_once to protect one-time calculations whose results are used all over the place. Ideally, sticking dispatch_once on a computation should cost no more than explicitly doing the computation ahead of time and returning the result from some global variable. In other words, once you hit scenario #3, we really want these two chunks of code to perform identically:

    id gObject;

    void Compute(void) {
        gObject = ...;
    }

    id Fetch(void) {
        return gObject;
    }

    id DispatchFetch(void) {
        static id object;
        static dispatch_once_t predicate;
        dispatch_once(&predicate, ^{
            object = ...;
        });
        return object;
    }

When inlined and optimized, the SimpleOnce function comes close to achieving this. In my testing, on my computer, it takes about half a nanosecond to execute. That, then, will be the gold standard for thread-safe versions.

Locks
The standard way to make something thread safe is to surround all access to shared data with a lock. It's a little tricky here, because there isn't a good place to put the lock next to the data it's protecting. dispatch_once_t is just a long, and there's no room for a lock.

We could just modify the API to take a structure that contains a lock and a flag. But in the interests of retaining the same signature as dispatch_once, and because this is just illustrative code, I decided to go with a single global lock instead. The code uses a static pthread_mutex_t to protect all accesses to the predicate. In a real program, with many different predicates, this would be a terrible idea, as unrelated predicates would still wait on each other. For a quickie example where I'm only testing with one predicate anyway, it's fine. The code is the same as before, except with a lock around it:

    void LockedOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

        pthread_mutex_lock(&mutex);
        if(!*predicate) {
            block();
            *predicate = 1;
        }
        pthread_mutex_unlock(&mutex);
    }

This code is thread safe, but unfortunately it's much slower. On my computer, it takes about 30 nanoseconds per call, compared to half a nanosecond for the previous version. Locks are pretty fast, but not when nanoseconds count.

Spinlocks
A spinlock is a way to implement a lock that tries to minimize overhead in the lock itself. The name comes from how a spinlock implementation "spins" on the lock when it has to wait, repeatedly polling it to see if it's been released. A normal lock will coordinate with the OS to sleep waiting threads, and wake them up when the lock is released. This saves CPU time, but the additional coordination doesn't come for free. By cutting down on that, a spinlock saves time in the case where the lock isn't held, at the expense of less efficient behavior when multiple threads try to take the lock at the same time.

OS X provides a convenient API for spinlocks with the OSSpinLock functions. Implementing LockedOnce with OSSpinLock is just a matter of changing out a few names:

    void SpinlockOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        static OSSpinLock lock = OS_SPINLOCK_INIT;

        OSSpinLockLock(&lock);
        if(!*predicate) {
            block();
            *predicate = 1;
        }
        OSSpinLockUnlock(&lock);
    }

This is a considerable improvement, clocking in at about 6.5 nanoseconds per call on my computer, versus 30 nanoseconds per call for the pthread_mutex version. However, it's still well short of the half-nanosecond time of the unsafe version.

Atomic Operations
Atomic operations are low-level CPU operations that are always thread safe even without locks. (Technically, they use locks at the hardware level, but that's an implementation detail.) They are what you use to implement locks in the first place. When locks carry too much overhead, directly using atomic operations can be a way to improve performance. Threaded programming without locks can be extremely tricky, so it's not a good idea to do it unless you really, really need it. In this case, we're talking about an OS library that could get a lot of use, so it probably qualifies.

The building block of atomic operations is the compare and swap. This is a single operation that performs the equivalent of:

    BOOL CompareAndSwap(long *ptr, long testValue, long newValue) {
        if(*ptr == testValue) {
            *ptr = newValue;
            return YES;
        }
        return NO;
    }

In words, it tests to see if a location in memory contains a particular value, and replaces it with a new value if so. It returns whether the value matched. Because compare and swap is implemented as an atomic CPU instruction, you're guaranteed that if multiple threads all try to do the same compare-and-swap on a memory location, only one will succeed.

The implementation strategy for this version of the function is to assign three values to the predicate. 0 means that it's never been touched. 1 indicates that the block is currently being executed, and any callers should wait for it. 2 indicates that the block has completed and all callers can return.

A compare and swap will be used to check for 0 and atomically transition to 1 in that case. If that succeeds, then that thread is the first one to make the call, and it will then run the block. After running the block, it will set the predicate to 2 to indicate that it's done.

If the compare and swap fails, then it will just enter a loop, checking for 2 repeatedly until it succeeds. This will cause it to wait for the other thread to finish executing the block.

The first step is to convert the predicate pointer into a volatile pointer, to convince the compiler that the value could be changed by other threads in the middle of the function:

    void AtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        volatile dispatch_once_t *volatilePredicate = predicate;

Then comes the compare and swap. Gcc and clang both provide various builtin functions beginning with __sync that implement atomic operations. There are also newer ones beginning with __atomic, but for this experiment I stuck with what I knew. This call does an atomic compare and swap on predicate, testing for 0 and setting it to 1 if it matches:

        if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {

The function returns true if the operation succeeded. In this case, that means that the predicate was 0 and this call is the first one. That means that the next task is to call the block:

            block();

Once the block completes, it's time to set the predicate to 2 to indicate that fact to any waiting threads, and to any future callers. However, before that, we need a memory barrier to ensure that everyone sees the proper ordering of reads and writes. More on that in a little bit. The __sync_synchronize builtin function performs a memory barrier:

            __sync_synchronize();

Then the predicate can safely be set:

            *volatilePredicate = 2;
        } else {

If the predicate was not 0, then enter a loop to wait for it to become 2. If it's already 2, this loop will terminate immediately. If it's 1, then it will sit on the loop, constantly retesting the predicate value, until it becomes 2:

            while(*volatilePredicate != 2)
                ;

Before returning, there needs to be a memory barrier here as well, to match the barrier above. Again, more on this in a little bit:

            __sync_synchronize();
        }
    }

This works, and should be safe. (Lockless threaded code is tricky enough that I don't want to declare that outright without more thought than I've put into it already. But this is a fairly simple scenario, and in any case we're not actually trying to build something production-worthy here.)

How's performance? Turns out, not so good. On my computer, it takes about 20 nanoseconds per call, quite a bit higher than the spinlock version.

Early Bailout
There's an obvious optimization that can be applied to this code. The common case is when predicate contains 2, but the code first tests for 0. By first testing for 2 and bailing out early, the common case can be made faster. The code for this is simple: add a check for 2 at the top of the function, and if it succeeds, return after a memory barrier:

    void EarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        if(*predicate == 2) {
            __sync_synchronize();
            return;
        }

        volatile dispatch_once_t *volatilePredicate = predicate;

        if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
            block();
            __sync_synchronize();
            *volatilePredicate = 2;
        } else {
            while(*volatilePredicate != 2)
                ;
            __sync_synchronize();
        }
    }

This is a decent improvement over the first version of the code, at about 11.5 nanoseconds per call. However, this is still much slower than the goal of half a nanosecond, and it's even slower than the spinlock code.

Memory barriers aren't free, and that explains why this code is so much slower than the goal. As for why it's slower than spinlocks, there are different kinds of memory barriers available. __sync_synchronize generates an mfence instruction, which is the most paranoid one possible, working with such exotica as SSE4 streaming reads/writes, while OSSpinLock sticks with a cheaper one suitable for normal code. We could fiddle around with the precise barrier used in this code to get better performance, but it's apparent that the cost is still going to be higher than we want, so I'll skip over that.

Unsafe Early Bailout
Let's look at one more version of the code. It's identical to the previous one, except it gets rid of the memory barriers:

    void UnsafeEarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
        if(*predicate == 2)
            return;

        volatile dispatch_once_t *volatilePredicate = predicate;

        if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
            block();
            *volatilePredicate = 2;
        } else {
            while(*volatilePredicate != 2)
                ;
        }
    }

Unsurprisingly, this performs just as fast as SimpleOnce at about half a nanosecond per call. Since *predicate == 2 is by far the common case, nearly every call just performs that check and returns. This performs the same amount of work as SimpleOnce in the case where the block has already been called.

However, as the name indicates, the lack of memory barriers makes it unsafe. Why?

Branch Prediction, Out of Order Execution, and You
We imagine that our CPUs are simple machines. We tell them to do one thing, and they do it. Then we tell them to do the next thing, and they do it. This repeats until we get bored or the power goes out.

Once upon a time, this was actually true. Old CPUs really were simple machines that worked exactly like this. They fetched an instruction, and they executed it. Then they fetched the next instruction, and they executed it.

Unfortunately, while this approach is simple, cheap, and easy, it's also not very fast. Moore's Law has blessed us with an ever-growing number of transistors to pile into CPUs. The 8086 was built from about 29,000 transistors. An Intel Haswell-architecture CPU contains well over a billion transistors.

The market, desiring better computer games (as well as a few people wanting to more quickly perform various boring business tasks that aren't really worth mentioning) demanded that CPU makers use these advances for better performance. The modern CPU is the product of decades of work to turn more transistors into faster computers.

There are a lot of tricks that go into making CPUs faster. One of those is pipelining. When you look at what goes into executing a single CPU instruction, there are a lot of little steps:

    1. Load the instruction from memory.
    2. Decode the instruction. (That is, figure out which instruction it is, and figure out what the operands are.)
    3. Load the input data.
    4. Perform the operation on the input data.
    5. Save the output.

On an early CPU, the sequence of work would look something like:

    load instruction
    decode
    load data
    operation
    save data
    load instruction
    decode
    load data
    operation
    save data
    ...

But as long as you have the resources for it, you can actually perform many of these actions in parallel. On a pipelined CPU, the sequence of work can end up looking more like this:

    load instruction
    decode                load instruction
    load data             decode                load instruction
    operation             load data             decode
    save data             operation             load data
                          save data             operation
                                                save data

This is a lot faster! With enough transistors, things can become even more complex, with many instructions executing in parallel simultaneously.

Going even beyond this, instructions can be executed completely out of order if it looks like this will speed things up. Unlike the simplified example above, real-world instructions tend to require more steps and there is a lot of variance in how many steps they take. For example, reads and writes to main memory can take an enormous amount of time. By executing other work that comes afterwards in the instruction stream, the CPU can engage in productive work instead of sitting idle. Because of this, the CPU may find it advantageous to issue memory reads and writes out of order from how the corresponding instructions appear in the instruction stream.

The end result of all this stuff is that your code doesn't always run in the order it looks like it runs in. If you write:

    x = 1;
    y = 2;

Your CPU can perform the write to y first. The compiler can also reorder statements like this in some cases, but even if you eliminate that, the CPU can still do it. If you have more than one CPU in your system (as we pretty much always do these days) then other CPUs will see these out of order writes. Even if the writes are performed in order, other CPUs can perform out of order reads. Put it all together, and another thread that's reading x and y can see y = 2 while still seeing the old value for x.

Sometimes you absolutely need these values to be written and read in the proper order, and that's where memory barriers come in. Adding a memory barrier to the above code ensures that x is written first:

    x = 1;
    memory_barrier();
    y = 2;

Similarly, a barrier when reading ensures that reads are performed in the proper order:

    use(x);
    memory_barrier();
    use(y);

However, because the memory barrier's entire purpose in life is defeating the CPU's attempts to go faster, there is an inherent performance penalty.

This comes into play with dispatch_once and lazy initialization because there are multiple reads and writes done in sequence and their order is extremely important. For example, the typical lazy object initialization pattern looks like:

    static SomeClass *obj;
    static dispatch_once_t predicate;
    dispatch_once(&predicate, ^{ obj = [[SomeClass alloc] init]; });
    [obj doStuff];

If obj was read before predicate, then it's possible that this code could read it when it still contains nil, right before another thread writes the final value out to the variable and sets predicate. This code could then read predicate as indicating that the job is done, and thus proceed with using the uninitialized nil. It's even conceivable that this code could read the correct value for obj but read uninitialized values from the memory allocated for the object, causing a crash when trying to send doStuff.

Thus, dispatch_once needs a memory barrier. But as we've seen, memory barriers are relatively slow, and we don't want to pay that price in the common case if we can avoid it.

Branch Prediction and Speculative Execution
The pipelined, out-of-order model works great for a linear sequence of instructions, but trouble arises for conditional branches. The CPU doesn't know where to start fetching more instructions until the branch condition can be evaluated, which typically depends on the immediately preceding instructions. The CPU has to stop, wait for the previous work to finish, then evaluate the branch condition and resume execution. This is called a pipeline stall, and can cause a significant performance penalty.

In order to compensate for that, CPUs engage in speculative execution. When they see a conditional branch, they make a guess as to which way they think the branch will go. Modern CPUs have sophisticated branch prediction hardware that can typically guess correctly over 90% of the time. They begin executing instructions from the guessed branch, instead of just waiting for the branch condition to be evaluated. If it turns out that the guess was correct, then it just keeps going. If the guess was wrong, it throws away all of the speculative execution and restarts on the other branch.

It is specifically this scenario that's at play on the read side of dispatch_once, the side that we want to make as fast as possible. There is a conditional branch on the value of predicate. The CPU should predict that the branch will take the common path, which bypasses running the block and immediately returns. During speculative execution of this branch, the CPU might load subsequent values from memory before they have been initialized by another thread. If that guess ended up being correct, then it commits the speculative execution which used uninitialized values.

Unbalancing Barriers
Write barriers generally need to be symmetrical. There's one on the write side to ensure that writes are done in the proper order, and one on the read side to ensure that reads are done in the proper order. However, our performance needs in this particular case are completely asymmetrical. We can tolerate a huge slowdown on the write side, but we want the read side to be as fast as possible.

The trick is to defeat the speculative execution that causes the problem. When the branch prediction is incorrect, the results of the speculative execution are discarded, and that means discarding the potentially uninitialized values in memory. If dispatch_once can force a branch mispredict after the initialized values become visible to all CPUs, the problem is avoided.

There's a critical interval of time between the earliest possible speculative memory read after the conditional branch on predicate and when the conditional branch is actually evaluated. The exact length of that interval will depend on the design of the specific CPU, but it will generally be in the low dozens of CPU cycles at most.

If the write side waits at least that much time between writing out the initialized values and writing the final value to predicate, all is well. It gets a bit tricky to ensure, though, since all of that crazy out-of-order execution comes into play once again.

On Intel CPUs, dispatch_once abuses the cpuid instruction for this purpose. The cpuid instruction exists to get information about the identity and capabilities of the CPU, but it also forcibly serializes the instruction stream and takes a decent amount of time to execute, hundreds of cycles on some CPU models, which is enough to do the job.

If you look at the implementation of dispatch_once, you'll see no barriers whatsoever on the read side:

    DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
    void
    _dispatch_once(dispatch_once_t *predicate, dispatch_block_t block)
    {
        if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
            dispatch_once(predicate, block);
        }
    }
    #define dispatch_once _dispatch_once

This is in the header, and always inlined into the caller. DISPATCH_EXPECT is a macro that tells the compiler to emit code that tells the CPU that the branch where *predicate is ~0l is the more likely path. This can improve the success of branch prediction, and thus improve performance. Ultimately, this is just a plain if statement with no barriers of any kind. Performance testing bears this out: calls to the real dispatch_once match the 0.5ns goal.

The write side can be found in the implementation. After calling the block and before doing anything else, dispatch_once uses this macro:

    dispatch_atomic_maximally_synchronizing_barrier();

On Intel, the macro generates a cpuid instruction, and it generates the appropriate assembly when targeting other CPU architectures.

Conclusion
Multithreading is a strange and complicated place, made all the more so by modern CPU designs which can do many things behind your back. Memory barriers allow you to inform the hardware when you really need things to happen in a certain order, but they come at a cost. The unique requirements of dispatch_once allow it to work around the problem in an unconventional way. By waiting a sufficiently long time between the relevant memory writes, it ensures that readers always see a consistent picture without needing to pay the cost of a memory barrier on every access.

That's it for today. Come back next time for more exciting stuff, probably about Swift since that's all the talk these days. If you have a topic you'd like to see covered here, and it's something besides "talk about Swift" (which is a fine topic but has sufficient requests at this point), please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Everyone's writing about Swift (me included), but I have yet to see something insightful and competent (again, me included). Since it's even in the name a performance analysis would be really interesting. So far all I've seen are amateurish micro benchmarks that don't tell us anything.

Looking forward to next week!
How does this deal with cpu caches? I assume that's the reason why Greg Parker says that the dispatch_once_t cannot ever have been non-zero. My assumption is that if you ensure it was never non-zero (by always using a local or global static), then the cpus will never have a non-zero value cached, so it can't ever think the value has been initialized when it hasn't.

I'd also like to know if OSMemoryBarrier() at the end of -init is sufficient to allow an instance variable to contain a dispatch_once_t after all. I was told that a long time ago by Andy Matuschak, but Greg's SO answer/comments suggests that's not true. Do you happen to know the answer here? Assuming that you don't hand out your object to another thread inside of -init, the memory barrier there means that the zero value has been flushed before the object ever becomes visible to another thread. And making objects visible to other threads generally involves some form of synchronization as well, which means the other thread should then correctly read the dispatch_once_t value as zero even if it had previously seen that memory location as non-zero in the past. Given this, it seems like it should work. Of course, if you vend your object via a non-synchronizing mechanism (such as a static variable) then all bets are off, but that's a bad idea anyway.
@KevinBallard I delved into this in some detail a while back, over on StackOverflow: http://stackoverflow.com/questions/19832150/using-dispatch-once-t-per-object-and-not-per-class/19835392#19835392

My understanding/belief is that yes, a well-placed memory barrier should allow the use of dispatch_once for ivars (and have gotten assenting reviews from some pretty smart folks.) Naturally, some dude on SO disagrees with me, and asks Greg Parker to come by and school me, but he never does, and so in the absence of hard evidence to the contrary, I remain satisfied with my analysis, and your thinking seems to align as well.

I suspect that the more relevant question here is whether the performance impact of having to use the memory barrier negates the performance advantage of using dispatch_once (over one of the other, slower mechanisms you could use to achieve the same effect.) Naturally, the answer to that will be situation dependent: how many of these objects are you going to create? How many times, per object, is the dispatch_once (or other implementation) going to be queried? And so on.

In the end, if nothing else, using dispatch_once in ivars is a 'controversial' pattern, and one which the docs are on record saying is unsupported/fraught with peril. I would have to have solid empirical evidence that this was the right thing to do for a given situation before I did it again.
Hi, re. that timecritical check for *predicate == 2, instead of prefixing with:
dispatch_atomic_maximally_synchronizing_barrier();

couldn't you just do:
CompareAndSwap(*predicate,2,2)

I.e. a seemingly meaningless swap, a "no-op", but wouldn't that also stop the CPU from reorder/optimize, because it cannot cheat, it has to fulfill "the semantic contract" of the compare-and-swap.

If that worked it would perhaps be regarded as less fishy than the CPUID instruction :-)
With your first lock example, I was expecting that you would write something like the following instead, attempting to only lock on the slow path:

void LockedOnce(dispatch_once_t *predicate, dispatch_block_t block) {
    static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

    if (!*predicate) {
        pthread_mutex_lock(&mutex);
        if (!*predicate) {
            block();
            memory_barrier();
            *predicate = 1;
        }
        pthread_mutex_unlock(&mutex);
    }
}

Is this slow and/or susceptible to the reordering issues (or otherwise broken, for that matter)?
I don't think the cpuid is actually required, but feel free to tell me that I'm wrong.

The comment in dispatch_once talks about systems with weakly ordered memory models like ARM. But x86 has a strongly ordered model: among other things, "any two stores are seen in a consistent order by processors other than those performing the stores" [Intel]. After calling dispatch_once, I expect to see any writes performed by the once routine, but once the store to the dispatch_once_t flag has been observed, this is already guaranteed.

The main reordering pitfall in x86 is that reads may be reordered with earlier writes to different locations, but I don't see how that would cause a problem here.
@Geoff Bailey: if you test the value before engaging the lock, you are not guaranteed to see consistent memory; only the lock makes the outer IF test valid. The lock has an implicit barrier, so to speak.

And you might be tempted to think that this construction


if (!TEST) {
 // dirty read
 GRAB_LOCK();
 if (!TEST) {
   // synchronized read
 }
 RELEASE_LOCK();
}


works, because once the global value has been communicated to all cores, the IF test will always behave correctly. But now you begin running into compiler and code re-ordering issues, where the result of the test may be remembered and the locked code will not actually perform a second read. So this idiom will not be safe across compilers and architectures either.
Nice write-up, thanks. It's amazing the efforts we'll collectively go through to avoid high-latency atomics or memory fence instructions. The technique reminds me a bit of delay-based quiescent consistency. See also http://blogs.oracle.com/dave/entry/qpi_quiescence and https://blogs.oracle.com/dave/resource/Asymmetric-Dekker-Synchronization-140215.txt.

@Canonical Chris: Thank you for your response; I see that I carelessly left out the volatile cast in the code I included. Does your objection still apply if predicate has been made volatile?
Hi Mike --

Great article and explanations. I'd appreciate if you (or someone else) could address the comment by 'comex': is CPUID actually necessary for proper execution on X86? If it wasn't there, what exactly is the problem that would occur, and under what circumstance? Without taking anything away from your article, I've yet to understand what the particular hazard is, and exactly how it is being avoided.
I've been taking a while to formulate a response to the CPUID question, because it's a tough one.

I think the reason may be that strong ordering is no longer guaranteed on x86. It can be, if you set it up to be, but you can also set it up to have weaker guarantees (and potentially better performance). Newer x86 CPUs (P6/PIII era) have ways to control the CPU's ordering guarantees. Memory Type Range Registers allow different regions of physical RAM to be given different ordering characteristics, and Page Attribute Tables allow controlling this on a per-page basis. Intel's manual says:

Intel recommends that software written to run on Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors assume the processor-ordering model or a weaker memory-ordering model. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors do not implement a strong memory-ordering model, except when using the UC memory type. Despite the fact that Pentium 4, Intel Xeon, and P6 family processors support processor ordering, Intel does not guarantee that future processors will support this model. To make software portable to future processors, it is recommended that operating systems provide critical region and resource control constructs and API’s (application program interfaces) based on I/O, locking, and/or serializing instructions be used to synchronize access to shared areas of memory in multiple-processor systems. Also, software should not depend on processor ordering in situations where the system hardware does not support this memory-ordering model.


I doubt that these capabilities are actually used to weaken the memory ordering constraints on userland-accessible RAM, but it seems like enough to justify some paranoia.
@Geoff Bailey:

Your implementation is missing a memory_barrier() after the first check. Without that, there's no guarantee you'll see a value stored by another core, so you may end up doing the initialization twice. Also, a full barrier can be very slow on some platforms (including older ARM, and possibly upcoming x64); you want to use a release-only barrier after the store, and an acquire-only barrier after the read.

@Canonical Chris: What you're saying is either out of date, or not quite accurate. The reason your check-lock-check doesn't work is that it's missing the fences.

Of course C and POSIX don't guarantee the presence of appropriate fence instructions, but both clang and gcc on x86/x64 and ARM do, so as long as you're not trying to write portable pthreads code, you can use check-lock-check in C.

Of course in C++11, conforming platforms are required to support the appropriate fences (although you also have to declare the tested variable atomic, even though you can read and write it in relaxed order, to prevent compiler reordering), so you can do it portably. Plus, static initializers are guaranteed to be once-protected, so you often don't even need check-lock-checked. And Java 5.0+ has similar guarantees, as does .NET 4.0.)

Unfortunately, C doesn't, and as far as I know neither do ObjC or Swift (not that anyone cares about portability in those languages). But there is a way to do it safely on every platform you're likely to ever want to port to, and there are libraries that wrap things up for you. (Windows would be tricky, except for the fact that Microsoft guarantees sufficient semantics for volatile if you're using native Win32 API threads.)
I would spin on the atomic op, not using a pseudo copy then yield ; I am sure this is faster relinquishing than trying to put or save or not save barriers infinite debate ; then do a weak test on it;


Related to DCLP I always have fun reading this article: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
After reading this post I went over to apple open source to read the dispatch_once implementation for myself. I think I managed to wrap my head around one of the older implementations http://opensource.apple.com/source/libdispatch/libdispatch-228.23/src/once.c. I then read one of the more recent implementations http://opensource.apple.com/source/libdispatch/libdispatch-442.1.4/src/once.c. One of the changes in the new implementation is that threads in the else block now read out *val once before the for loop, as opposed to at the beginning of each iteration. This change confuses me, because as far as I can tell it introduces a potential bug. Consider the following thread interleaving. Assume thread 1 succeeds at the first dispatch_atomic_cmpxchg and enters the if block, while thread 2 enters the else.


dispatch_atomic_maximally_synchronizing_barrier(); // Thread 1
next = *vval; // Thread 2 - reads out pointer to dow on Thread 1's stack
next = dispatch_atomic_xchg(vval, DISPATCH_ONCE_DONE, relaxed); // Thread 1
while (next != tail) // Thread 1 - does not enter the loop
if (dispatch_atomic_cmpxchgvw(vval, next, tail, &next, release)) // Thread 2
// this fails because next points to the dow from Thread 1, but now *vval holds DISPATCH_ONCE_DONE
// Thread 2 loops forever


I would greatly appreciate it if someone can help me figure out the error in my reasoning. Thanks for any help!
Any thoughts on how dispatch_once works when called from Swift? For example, are static predicates still required to be sure that they are non-zero and not cached? Or does the Swift compiler help with that at all?

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.