mikeash.com: just this guy, you know?

Posted at 2015-06-05 13:06 | RSS feed (Full text feed) | Blog Index
Next article: I Do Not Agree To Your Terms
Previous article: Friday Q&A 2015-05-01: Fuzzing with afl-fuzz
Tags: fridayqna objectivec threading
Friday Q&A 2015-05-29: Concurrent Memory Deallocation in the Objective-C Runtime
by Mike Ash  

The Objective-C runtime is at the heart of much Mac and iOS code. At the heart of the runtime is the objc_msgSend function, and the heart of that is the method cache. Today I'm going to explore how Apple manages resizing and deallocating method cache memory in a thread safe manner without impacting performance, using a technique you probably won't find in textbooks discussing thread safety.

Message Sending in Concept
objc_msgSend works by looking up the appropriate method implementation for the method being sent, and then jumping to it. Conceptually, looking up the method works like this:

    IMP lookUp(id obj, SEL selector) {
        Class c = object_getClass(obj);

        while(c) {
            for(int i = 0; i < c->numMethods; i++) {
                Method m = c->methods[i];
                if(m.selector == selector) {
                    return m.imp;
                }
            }

            c = c->superclass;
        }

        return _objc_msgForward;
    }

Some names have been changed to protect the innocent. If you're interested in seeing the real code, check out the Objective-C runtime source code:

http://www.opensource.apple.com/source/objc4/

Method Cache
Most Objective-C code sends messages all over the place. If the full message search was performed for each one, it would be unbelievably slow.

The solution to this is a cache. Each class has a hash table attached to it which maps selectors to method implementations. The hash table is built for maximum read efficiency, and objc_msgSend uses carefully tuned assembly language code to perform the hash table lookup quickly. This gets a message send in the cached case down to single-digit nanoseconds. The first use of any given message is still unbelievably slow, but after that it's fast.

When we think of a cache, it's usually something with a limited size that's intended to speed up multiple accesses to recently used resources. For example, you might cache images that you load from the internet so that two fetches in quick succession don't hit the network twice. You don't want to use too much memory, though, so you might cap the number of images you keep in the cache at any given time, and throw away the oldest image when a new one comes in after it fills up.

This is a fine approach for many problems but it can have unfortunate performance implications. For example, if you set your image cache to store 40 images, and you run into a case where your application is constantly cycling through 41 images, your cache suddenly becomes completely useless.

For our own apps we can test and tune the caches to avoid this, but the Objective-C runtime doesn't have this option. Because the method cache is so critical to performance, and because each entry is relatively small, the runtime doesn't impose any size limit on the caches, and expands them as necessary to cache all messages that have been sent.

Note that the caches do sometimes get flushed; any time something happens that might cause the cached data to become stale, such as loading new code into the process or modifying a class's method lists, the appropriate caches are destroyed and allowed to refill.

Resizing, Deallocation, and Threads
Resizing the cache is pretty simple in concept. It looks something like:

    bucket_t *newCache = malloc(newSize);
    copyEntries(newCache, class->cache);
    free(class->cache);
    class->cache = newCache;

The Objective-C runtime actually takes a small shortcut here: it doesn't even copy the old entries into the new cache! It's just a cache, after all, and there's no requirement to preserve the data it contains. Entries refill as messages are sent. So it's really just:

    free(class->cache);
    class->cache = malloc(newSize);

In a single-threaded environment, this would be all you need, and this article would be short. But of course the Objective-C runtime has to support multithreaded code, and that means that all of this code has to be thread safe. Any given class's cache can be accessed simultaneously from multiple threads, so this code has to take care to ensure that it tolerates that scenario.

As written here, it won't. There's a window of opportunity after freeing the old cache and before assigning the new cache where another thread might access an invalid cache pointer. This could cause it to see garbage data, or even just crash immediately if the underlying memory was unmapped.

How can we solve this problem? The typical approach to protecting shared data like this is to use a lock. The code would then look like:

    lock(class->lock);
    free(class->cache);
    class->cache = malloc(newSize);
    unlock(class->lock);

All accesses must be gated by the lock, including reads, for this to work. That means that objc_msgSend would have to acquire the lock, look in the cache, and release the lock. Acquiring and releasing the lock each time would add a lot of overhead, considering that the cache lookup itself only takes a few nanoseconds. The performance impact is just too high.

We might try to close the window in some other way. For example, what if we allocated and assigned the new cache first, and then deallocated the old cache?

    bucket_t *oldCache = class->cache;
    class->cache = malloc(newSize);
    free(oldCache);

This helps, but it doesn't solve the problem. Another thread might retrieve the old cache pointer, then get preempted by the system before it can access the contents. The old cache could then be destroyed before the other thread runs again, causing the same problems as before.

What if we put in a delay? Something like:

    bucket_t *oldCache = class->cache;
    class->cache = malloc(newSize);
    after(5 /* seconds */, ^{
        free(oldCache);
    });

This is almost certain to work. But it's still conceivable that a thread might get preempted at just the right moment and stay preempted for long enough that the five-second delay fires first. This makes the crash extremely unlikely, but doesn't completely eliminate it.

Rather than an arbitrary delay, how about waiting until the window is surely clear? Let's add a counter to objc_msgSend so that it looks something like:

    gInMsgSend++;
    lookUpCache(class->cache);
    gInMsgSend--;

A proper thread safe version would need to use atomics for the counter and appropriate memory barriers to make sure the dependent loads/stores show up properly. For the purposes of this article, just imagine that stuff is there.

With the counter, cache reallocation would look like:

    bucket_t *oldCache = class->cache;
    class->cache = malloc(newSize);
    while(gInMsgSend)
        ; // spin
    free(oldCache);

Note that there is no need to block execution of objc_msgSend for this to work properly. Once the cache free code is sure that nothing is in objc_msgSend at any particular moment after it has replaced the cache pointer, it can go ahead and free the old one. Another thread might call out to objc_msgSend while the old cache pointer is being deallocated, but this new call can't possibly see the old pointer anymore, so it's safe.

Spinning is inefficient and inelegant. It's not particularly urgent to free these caches. It's nice to deallocate the memory, but it's not terrible if it takes some time. Rather than spinning, let's keep a list of unfreed caches, and each time something is freed, try to clear everything that's pending:

    bucket_t *oldCache = class->cache;
    class->cache = malloc(newSize);

    append(gOldCachesList, oldCache);
    if(!gInMsgSend) {
        for(cache in gOldCachesList) {
            free(cache);
        }
        gOldCachesList.clear();
    }

If a message send is in progress then this won't immediately free the old cache, but that's not a problem. The next time through it will be cleared, or the time after that, or at some point in the future.

This version is pretty close to how the Objective-C runtime actually does it.

Zero-Cost Flags
There's an extreme asymmetry here between the two interacting parts. The objc_msgSend side runs potentially millions of times each second and really needs to be as fast as possible. The best case running time for a single call is just a few nanoseconds. On the other hand, resizing the cache is a rare operation that will typically get less and less common as an app continues to run. Once the app reaches a steady state, no longer loading new code or editing message lists and with the caches as big as they need to be, it'll never happen. Before that, it may happen some hundreds or thousands of times as the caches grow to the size they need, but it's extremely rare in comparison to objc_msgSend and vastly less performance sensitive.

Because of this asymmetry, it's best to put as little as possible on the message send side, even if it makes the cache freeing part much slower. Shaving off one CPU cycle in objc_msgSend at the cost of a million CPU cycles in each cache free operation is a net win, by a huge margin.

Even a global counter is too costly. That's two additional memory accesses within objc_msgSend which would still add a great deal of overhead. They would need to be atomic and use memory barriers which makes it even worse. Fortunately, the Objective-C runtime has a technique for reducing the cost on the objc_msgSend side to zero, at the expense of making the cache free code much slower.

The purpose of the hypothetical global counter is to track when any thread is within a particular region of code. Threads already have something that tracks what code they're currently running: the program counter. This is the CPU register which tracks the memory address of the current instruction. Instead of a global counter, we could check each thread's program counter to see if it's within objc_msgSend. If all threads are outside, then it's safe to free the old caches. Here's what that implementation would look like:

    BOOL ThreadsInMsgSend(void) {
        for(thread in GetAllThreads()) {
            uintptr_t pc = thread.GetPC();
            if(pc >= objc_msgSend_startAddress && pc <= objc_msgSend_endAddress) {
                return YES;
            }
        }
        return NO;
    }

    bucket_t *oldCache = class->cache;
    class->cache = malloc(newSize);

    append(gOldCachesList, oldCache);
    if(!ThreadsInMsgSend()) {
        for(cache in gOldCachesList) {
            free(cache);
        }
        gOldCachesList.clear();
    }

Then objc_msgSend doesn't have to do anything special at all. It can access the caches directly without worrying about flagging that access. It just does:

    lookUpCache(class->cache);

The cache free code is pretty inefficient because it needs to examine the state of every thread in the process. But objc_msgSend is as efficient as it would be if it were written for a single-threaded environment, and that's a tradeoff well worth making. This is ultimately how Apple's runtime code works.

The Real Code
Apple's implementation of this technique can be found in the runtime function _collecting_in_critical located in objc-cache.mm.

The critical PC locations are stored in global variables:

    OBJC_EXPORT uintptr_t objc_entryPoints[];
    OBJC_EXPORT uintptr_t objc_exitPoints[];

There are actually multiple objc_msgSend implementations (for things like struct returns), and the internal cache_getImp function also accesses the cache directly. They all need to be checked in order to safely deallocate caches.

The function itself takes no parameters and returns int, which is just used as a boolean flag to indicate whether any threads are in one of the critical functions or not:

    static int _collecting_in_critical(void)
    {

I'm going to skip over the less interesting bits of code in this function in the interest of concentrating on the best parts. If you want to see the whole thing, take a look at opensource.apple.com.

The APIs for getting thread information lie at the mach level. task_threads gets a list of all threads in a given task (mach's term for a process), and this code uses it to get the threads in its own process:

        ret = task_threads(mach_task_self(), &threads, &number);

That returns an array of thread_t values in threads, and the number of threads in number. Then it loops over them:

        for (count = 0; count < number; count++)
        {

Fetching the PC for a thread is done in a separate function, which we'll look at shortly:

            pc = _get_pc_for_thread (threads[count]);

It then loops over the entry and exit points and compares with each one:

            for (region = 0; objc_entryPoints[region] != 0; region++)
            {
                if ((pc >= objc_entryPoints[region]) &&
                    (pc <= objc_exitPoints[region])) 
                {
                    result = TRUE;
                    goto done;
                }
            }
        }

After the loop, it returns the result to the caller:

        return result;
    }

How does _get_pc_for_thread work? It's a relatively simple bit of code that calls thread_get_state to get the register state of the target thread. The main reason it's in a separate function is because the register state structures are architecture-specific, since each architecture has different registers. That means this function needs a separate implementation for each supported architecture, although the implementations are almost identical. Here's the implementation for x86-64:

    static uintptr_t _get_pc_for_thread(thread_t thread)
    {
        x86_thread_state64_t            state;
        unsigned int count = x86_THREAD_STATE64_COUNT;
        kern_return_t okay = thread_get_state (thread, x86_THREAD_STATE64, (thread_state_t)&state, &count);
        return (okay == KERN_SUCCESS) ? state.__rip : PC_SENTINEL;
    }

Note that rip is the register name of the PC on x86-64; the R stands for "register," and the IP stands for "instruction pointer."

The entry and exit points themselves are defined in the assembly language file where the functions in question are defined. They look like this:

    .private_extern _objc_entryPoints
    _objc_entryPoints:
        .quad   _cache_getImp
        .quad   _objc_msgSend
        .quad   _objc_msgSend_fpret
        .quad   _objc_msgSend_fp2ret
        .quad   _objc_msgSend_stret
        .quad   _objc_msgSendSuper
        .quad   _objc_msgSendSuper_stret
        .quad   _objc_msgSendSuper2
        .quad   _objc_msgSendSuper2_stret
        .quad   0

    .private_extern _objc_exitPoints
    _objc_exitPoints:
        .quad   LExit_cache_getImp
        .quad   LExit_objc_msgSend
        .quad   LExit_objc_msgSend_fpret
        .quad   LExit_objc_msgSend_fp2ret
        .quad   LExit_objc_msgSend_stret
        .quad   LExit_objc_msgSendSuper
        .quad   LExit_objc_msgSendSuper_stret
        .quad   LExit_objc_msgSendSuper2
        .quad   LExit_objc_msgSendSuper2_stret
        .quad   0

_collecting_in_critical is used much like in the hypothetical examples above. It's called before the code that frees leftover cache garbage. The runtime actually has two separate modes: one which leaves the garbage for the next time if other threads are in a critical function, and one which spins in a loop until the coast is clear, and always deallocates the garbage:

    // Synchronize collection with objc_msgSend and other cache readers
    if (!collectALot) {
        if (_collecting_in_critical ()) {
            // objc_msgSend (or other cache reader) is currently looking in
            // the cache and might still be using some garbage.
            if (PrintCaches) {
                _objc_inform ("CACHES: not collecting; "
                              "objc_msgSend in progress");
            }
            return;
        }
    } 
    else {
        // No excuses.
        while (_collecting_in_critical()) 
            ;
    }

    // free garbage here

The first mode, which leaves garbage for the next time, is used for normal cache resizes. The spin mode that always frees garbage is used in the runtime method that flushes all caches for all classes as this would typically generate a large amount of garbage. As best I can tell from examining the code, this only happens when enabling a debug logging facility that logs all message sends to a file. It flushes caches because the message cache interferes with the logging.

Conclusion
Performance and thread safety are often at odds with each other. Often there is asymmetry in how different parts of code access shared data, which allows more efficient thread safety. A global flag or counter that indicates when a mutating action is unsafe can be one way to exploit this. In the Objective-C runtime, Apple takes this a step further and uses the program counter of each thread as an implicit indication of when a thread is taking unsafe action. This is a specialized case and it's hard to see where else the technique could be useful, but it's fascinating to take apart.

That's it for today. Check back next time for more exciting action. Friday Q&A is driven by reader ideas, so if you have an idea you'd like to see covered here, please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

A coupe of question, Mike:

1. Which code calls the long-running cache cleaning code and on which thread? Where code checks that the cache needs to be resized at all, without impacting performance?

2. While _collecting_in_critical is executing, it iterates over threads. This happens on one thread. Isn't it possible that while it checked a thread early in the list, when it gets to one in the end, the first thread will chance its PC and either enter or exit the critical locations?
Good questions.

1. The cache cleanup code is called whenever a cache is resized or caches are flushed. Cache resize happens in the cache fill code. When objc_msgSend can't find an entry in the cache, it then calls into the runtime's C code which performs a slow lookup and puts the appropriate entry in the cache. If the cache is too full at that point, this code will allocate a larger one, put the older one on the garbage list, and try to free the garbage list. This means any message send is potentially doing all this work. But in normal circumstances it will be rare, because after the first time, the result will be cached.

2. It's entirely possible, but it's not a problem. If the thread in question exits the critical locations after being checked, then you have a false positive. This means that you'll fail to free the garbage when you could have done so safely, but that's not a failure, because the garbage will just stay on the list and get freed whenever collection happens and nothing is in a critical location. If the thread in question enters the critical locations after being checked, it can't possibly see the garbage pointer because that pointer has already been replaced in the class structure, so it's still safe to free the garbage. The only case that can cause problems is if a thread was in a critical location before the cache pointer was replaced, and remains in the critical location after the old pointer is freed.
How do they make this scheme work when ALSR is in effect? Is it simply that the addresses for objc_msgSend* aren't randomized due to dynamic linking or some other effect?
When the code in question is relocated, the values stored in objc_entryPoints and objc_exitPoints will be updated accordingly. It's no different from how you can put a value in a function pointer and that value still works if the target function is relocated when loading.
Sorry meant ASLR. Thinking about it a bit more, it's easy to get the address of the start of a function, but I don't see how to reliably get a pointer to the end without weakening ASLR
I don't think there's any fundamental difference between the start and the end of a function in this case. They're all just labels that point to locations in the code. It's unusual to create a label at the end of a function rather than the start, but it's fundamentally no different from having a label in the middle of a function, for example to use as a jump target to implement a conditional or a loop.
Again an awesome and well-explained insight into the inner workings of Objective-C. It's always very interesting to read, thank you!
You should need to use a memory barrier after allocating the new cache and before checking for threads in the critical paths.
Otherwise a reorder in an out of order architecture like x86 could result in threads that enter the critical path after the ThreadsInMsgSend check but they would using the old cache.

Could it be that it works without the barrier because there are memory barriers in the malloc implementation used?
it is worth noting that _collecting_in_critical is not signal-safe, a thread can be inside one of the critical regions without the PC being in between any of the entry and exit points if the thread has caught a signal and is running a signal handler.
Seems that it is possible for the 2nd last cache to never get freed, if during the last cache resize there are threads that use objc_msgSend().
Anonymous Coward, that is a very valid insight. Perhaps the Objective C runtime has a thread dedicated to gobbling up all asynchronous signals. I'm not that well-versed in the interactions with the Mach kernel to know the intricacies of signal delivery on OS X.
One day, studying the objc_msgSend implementation (as one does… probably after having read some Gregory Parker posts), I started wondering how it could use a cache while avoiding taking a lock, and started looking for the flip side of the cache code. When I discovered what you presented here, I was all “Double-U… Tee… Eff…” Yeah, that stuff you definitely don't learn in any concurrency curriculum. I did stumble at another point on another example of an extremely asymmetric mutual exclusion at http://queue.acm.org/detail.cfm?id=1117403 (“Not Optimizing for the Common Case” part).

The main drawback that I can tell of the technique used by the Objective-C runtime is that it may not scale beyond about 20 os so threads: imagine you have code that is in objc_msgSend about 5% of the time (which is not unusual), if you have more than 20 threads running code with these characteristics, every time the cleanup up code will attempt to free old caches it will find at least one thread in objc_msgSend and not free anything, theoretically allowing the list of unfreed caches to grow arbitrarily…
That's a wonderful article, thanks for sharing the link.

I've thought about the problem with lots of threads, although usually from a performance perspective: if you have 10,000 threads, _collecting_in_critical might start to become painfully slow. I'm not sure exactly what happens. Your scenario seems feasible. Maybe it doesn't happen much in reality for some reason. Interesting to think about.
You're welcome; I think I got to it from following the Sun/ZFS/Dtrace guys. Interestingly, some of its advice is completely opposite to Apple practice (e.g. we know Apple loooooves layering).
Daniel Breitlauch: I’m guessing all the protected areas are first entry and last exit points for the functions, and that instruction reordering can’t reorder beyond either of these limits.
Wil Shipley, Daniel Breitlauch: The actual code does use memory barriers when installing the new cache buckets. Look for mega_barrier for the X86-64 version in http://www.opensource.apple.com/source/objc4/objc4-646/runtime/objc-cache.mm
For my own objc runtime I can't use a hazard pointer scheme (https://en.wikipedia.org/wiki/Hazard_pointer) like this for portability reasons. I want it to be based on pthreads/C11 threads only, so I don't have access to the PC counter of the other threads.

I thought up this scheme called mulle-aba, that is described at https://www.mulle-kybernetik.com/weblog/2015/mulle_aba_how_it_works_1.html . I like to think it's technically superior to Apple's technique as it scales more nicely, but feel free to let me know, if you don't think so. ;)

Are threads suspended during `_collecting_in_critical`? If so, how? I don't see any calls to thread_suspend in that function.
Threads aren't suspended. The new pointer is placed before calling _collecting_in_critical, so if no threads are in a critical region then you know it's safe, because any threads that come along afterwards will see the new pointer.
Actually threads _are_ suspended, because thread_get_state() will suspend the target thread, which is required else the threads even not in the "critical" section could enter it and could still see old pointers.

The suspension happens in the kernel (forcing the thread to be knocked off-core if it was running) and the implied context-switch gives you the proper memory barriers. Unfortunately this suspension-resume is a huge priority inversion (if the thread causing the GC is of very low priority)
c

BTW: this technology is no longer used that way since macOS Mojave and whatever iOS version is aligned. The high level idea is similar, but done in a more scalable and priority-inversion-less way (it made the GC about 20x as fast).
That makes sense, thanks. My next question is, does thread_get_state() allow a less intrusive way of getting of getting the thread's backtrace than using a signal handler on that thread? Because it seems that signal handlers interrupting read and write calls would cause issues if the application doesn't have retry logic.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.