Next article: Objective-C Runtime Podcast Episode at Mobile Orchard
Previous article: Unicode Comments Support
Tags: fridayqna gcd performance
Welcome back to Friday Q&A. This week's edition lines up with Apple's release of Snow Leopard, so I'm going to take this opportunity to open up the discussion on previously NDA'd technologies and talk about some of the cool stuff now available in Snow Leopard. For this week I'm going to start what I plan to be an ongoing series on Grand Central Dispatch, a topic suggested by Chris Liscio.
What Is It?
Grand Central Dispatch, or GCD for short, is a low level API which introduces a new way to perform concurrent programming. For basic functionality it's a bit like NSOperationQueue, in that it allows a program's work to be divided up into individual tasks which are then submitted to work queues to run concurrently or serially. It's lower level and higher performance than NSOperationQueue, and is not part of the Cocoa frameworks.
In addition to the facilities for parallel execution of code, GCD also provides a fully integrated event handling system. Handlers can be set up to respond to events on file descriptors, mach ports, and processes, to timers and signals, and to user-generated events. These handlers are executed through the GCD facilities for concurrent execution.
GCD's API is heavily based around blocks, which I talked about in previous Friday Q&A's, first to introduce the basics of blocks and then to discuss the practical aspects of using blocks in real-world code. While GCD can be used without blocks, by using the traditional C mechanism of providing a function pointer and a context pointer, it's vastly easier to use and ultimately more capable, from a practical standpoint, when used with blocks.
For documentation on GCD, start with man dispatch
on a Snow Leopard machine.
Why Use It?
GCD offers many advantages over traditional multi-threaded programming:
- Ease of use: GCD is much easier to work with than threads. Because it's based around work units rather than threads of computation, it can take care of common tasks such as waiting for work to finish, monitoring file descriptors, executing code periodically, and suspending work. The blocks-based APIs make it extremely easy to pass context between different sections of code.
- Efficiency: GCD is implemented in a lightweight manner which makes it practical and fast to use GCD in many places where creating dedicated threads is too costly. This ties into ease of use: part of what makes GCD so easy to use is that for the most part you can just use it, and not worry too much about using it efficiently.
- Performance: GCD automatically scales its use of threads according to system load, which in turn leads to fewer context switches and more computational efficiency.
Although pure C, GCD is built in an object-oriented style. GCD objects are called dispatch objects. Dispatch objects are reference counted, much like Cocoa objects. The
dispatch_retain
and dispatch_release
functions can be used to manipulate the reference count of dispatch objects for the purposes of memory management. Note that unlike Cocoa objects, dispatch objects do not participate in garbage collection, so you will have to manage GCD objects manually even if you have GC enabled.
Dispatch queues and dispatch sources (more on what these are later) can be suspended and resumed, can have an arbitrary context pointer associated with them, and can have a finalizer function associated with them. For more information on these facilities, see man dispatch_object
.
Dispatch Queues
A fundamental concept in GCD is that of the dispatch queue. A dispatch queue is an object which accepts jobs and which executes them in the order in which they arrive. A dispatch queue can either be concurrent or serial. A concurrent queue will execute many jobs simultaneously, as appropriate for system load, much like NSOperationQueue. A serial queue will only execute a single job at a time.
There are three main types of queues in GCD:
- The main queue: Analogous to the main thread. In fact, jobs submitted to the main queue execute on the main thread of the process. The main queue can be obtained by calling
dispatch_get_main_queue()
. Since the main queue is inherently tied to the main thread, it is a serial queue. - Global queues: Global queues are concurrent queues shared through the entire process. Three global queues exist: a high, a default, and a low priority queue. Global queues can be accessed by calling
dispatch_get_global_queue
and telling it which priority you want. - Custom queues: Custom queues (GCD does not call them this, but doesn't have a specific name for these, so I call them "custom") are queues created with the
dispatch_queue_create
function. These are serial queues which only execute one job at a time. Because of this, they can be used as a synchronization mechanism, much like a mutex in a traditional threaded program.
If you want to use a custom queue, you'll have to create one. To do this, just call
dispatch_queue_create
. The first parameter is a label, which is purely for debugging purposes. Apple recommends using reverse-DNS naming to give the queue a unique name, like "com.yourcompany.subsystem.task"
. These names show up in crash logs and can be queried from the debugger and will help a lot when trying to see where things went wrong. The second argument is an attribute argument which is currently unsupported, so pass NULL
.
Submitting Jobs
Submitting a job to a queue is easy: call the dispatch_async
function, and pass it a queue and a block. The queue will then execute that block when it's that block's turn to execute. Here is an example of executing some long-running job in the background using a global queue:
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
[self goDoSomethingLongAndInvolved];
NSLog(@"Done doing something long and involved");
});
dispatch_async
returns immediately, and then the block will execute asynchronously in the background.
Of course, it's not really very useful to perform an NSLog
when the work is done. In a typical Cocoa application, you probably want to update a part of your GUI, and that in turn means running code on the main thread. You can easily accomplish this by using nested dispatches, with the outer one performing the background work, and then from within the background block dispatching onto the main queue, like this:
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
[self goDoSomethingLongAndInvolved];
dispatch_async(dispatch_get_main_queue(), ^{
[textField setStringValue:@"Done doing something long and involved"];
});
});
dispatch_sync
function, which does the same thing but which waits for the block to complete before returning. In conjunction with the __block
type qualifier, this can be used to get a value back from the executing block. For example, you may have some code running on a background thread (or better yet, a non-main dispatch queue) which needs to get a value from a GUI control. You can do this easily by using dispatch_sync
and dispatch_get_main_queue
:
__block NSString *stringValue;
dispatch_sync(dispatch_get_main_queue(), ^{
// __block variables aren't automatically retained
// so we'd better make sure we have a reference we can keep
stringValue = [[textField stringValue] copy];
});
[stringValue autorelease];
// use stringValue in the background now
dispatch_queue_t bgQueue = myQueue;
dispatch_async(dispatch_get_main_queue(), ^{
NSString *stringValue = [[[textField stringValue] copy] autorelease];
dispatch_async(bgQueue, ^{
// use stringValue in the background now
});
});
myQueue
could be a custom queue or it could just be one of the global queues.
Replacing Locks
Custom queues can be used as a synchronization mechanism in place of locks. In traditional multi-threaded programming, you might have an object which is designed to be usable from multiple threads. In order to accomplish this, it protects all accesses to shared data using a lock, which you might find in an instance variable:
NSLock *lock;
- (id)something
{
id localSomething;
[lock lock];
localSomething = [[something retain] autorelease];
[lock unlock];
return localSomething;
}
- (void)setSomething:(id)newSomething
{
[lock lock];
if(newSomething != something)
{
[something release];
something = [newSomething retain];
[self updateSomethingCaches];
}
[lock unlock];
}
dispatch_queue_t queue;
dispatch_queue_create
. You would then wrap all code accessing shared data in dispatch_async
or dispatch_sync
:
- (id)something
{
__block id localSomething;
dispatch_sync(queue, ^{
localSomething = [something retain];
});
return [localSomething autorelease];
}
- (void)setSomething:(id)newSomething
{
dispatch_async(queue, ^{
if(newSomething != something)
{
[something release];
something = [newSomething retain];
[self updateSomethingCaches];
}
});
}
At this point you may be asking, this is all well and good, but what's the point? I just switched code from one mechanism to another mechanism that looks pretty much the same. Why would you do this?
There are actually several advantages to the GCD approach:
- Parallelism: Notice how
-setSomething:
usesdispatch_async
in the second version of the code. This means that the call to-setSomething:
will return right away, and then the bulk of the work will happen in the background. This could be a significant win ifupdateSomethingCaches
is a costly operation and the caller will be doing something processor intensive as well. - Safety: It's impossible to accidentally write a code path that doesn't unlock the lock using GCD. In normal locked code it's not unusual to inadvertently put a return statement in the middle of the lock, or conditionalize the exit, or something equally unfortunate. With GCD, the queue always continues to run and you can't help but return control to it normally.
- Control: It's possible to suspend and resume dispatch queues at will, which cannot easily be done with a locks-based approach. It's also possible to point a custom queue at another dispatch queue, making it inherit the attributes of that other dispatch queue. Using this, the priority of the queue can be adjusted by making it point to the different global queues, and the queue can even be made to execute code on the main thread if this were required for some reason.
- Integration: The GCD event system integrates with dispatch queues. Any events or timers that the object needs to use can be pointed at the object's queue, causing the handlers to automatically run on that queue, making them automatically synchronized with the object.
Conclusion
Now you know the basics of Grand Central Dispatch, how to create dispatch queues, how to submit jobs to dispatch queues, and how to use queues as a substitute for locks in multithreaded programs. Next week I'll show you techniques for using GCD to write code which performs parallel processing to extract more performance out of multi-core systems. And in the coming weeks, I'll discuss more of GCD in depth, including the event system and queue targeting.
That wraps up this week's Friday Q&A. Come back next week for more GCD goodness. And just because I have a subject mapped out for a few weeks doesn't mean I don't need your suggestions. Quite the contrary: Friday Q&A is driven by your suggestions, and the more I have, the better posts I'll be able to make when I get to the end of this series. If you have a suggestion for a topic to cover, please post it in the comments or e-mail it directly to me.
Comments:
If it's the former, that would be a pretty nice automatic perceptual speedup just for recompiling in Snow Leopard.
If the latter, well I guess we'll see a new version of Accessorizer pretty soon :)
Note that despite my example, accessors are still almost always the wrong place to enable thread safety. They are far too granular in most cases for this to be a good place to do that. It really is beyond me why Apple made properties atomic by default, given this. You would generally protect an entire subsystem's interface with a lock or queue, and then have free access to all of that subsystem's objects' properties without locking or other synchronization within that.
With GCD & queues, the behavior with an exception is undefined since the execution of the block could happen on any thread.
So, you'd still need an exception handler if you wanted to generically replace @synchronize()'s lock with a queue and that is the main source of the performance hit.
BTW: If anyone reading this has ideas for improving & extending Blocks, GCD, or anything else, please file a bug via http://bugreport.apple.com/
Thanks for your (as usual) well written post -- I've come to really enjoy them.
This one comes in particularly handy as I decided to convert something I'm working on to SL last night and started digging through my notes & WWDC videos to refresh myself on how GCD worked -- While the information is there to be had this is a nice concise summary.
One thing that I've never seen data on is the actual overhead of GCD -- at WWDC is was listed as 'small' but what is the break even point? Is there a figure of merit for how much work you need to perform before throwing it on a background queue is a win?
-John
dispatch_queue_t dq1 = dispatch_queue_create("com.example.q1", NULL);
dispatch_queue_t dq2 = dispatch_queue_create("com.example.q2", NULL);
dispatch_sync(q1, ^{
dispatch_sync(q2, ^{
dispatch_sync(q1, ^{
printf("Help, I've deadlocked an I can't get up!\n");
});
});
});
dispatch_release(q2);
dispatch_release(q1);
Obviously this structure is contrived, but it's extraordinarily easy to create if you start using queues around setters/accessors for interconnected objects. And just to make people aware, trying to check dispatch_get_current_queue() is not safe against this problem. If you take out q2, it is, but that's a dangerous assumption.
Paul: This is a fine point. For this reason you should use the _sync variant as little as possible, and do as little as possible within the block when you do use it.
I definitely agree: anything in a dispatch block should be as small as possible, or else you'll probably wind up in trouble pretty quick.
http://loghound.com/about/blog2/index.php?id=7996637473678888874
Basically in my simple experiment it took thousands (~2500) method calls to equal the overhead of any sort of backgrounding approach.
GCD appeared to be slightly faster (150us vs 200us) but they were all pretty long compared to just doing the work directly.
You're also testing two separate mechanisms simultaneously: backgrounding a task, and then bringing it back to the main thread. This is a common pattern and thus a useful measurement, but it would also be interesting to see what it costs to only do the backgrounding.
Finally, you're measuring latency but not throughput, which is commonly going to be more important for real-world tasks.
Your examples get me really excited! GCD seems so simple yet it's so powerful! Thanks for another great Q&A.
Jim: Try this URL:
http://www.mikeash.com/pyblog/rss.py?mode=fulltext
I made it after a similar request but the fellow never told me if it worked well for him so I never got around to making it public.
Your right about the gcd example, I had the order of execution reversed (first the main thread and then the background thread) That was a mistake and not suprisingly fixing it didn't change the result (150us)
I updated the post to show the costs to do three cases (see table in the middle)
1) The case I examined (push work off on background and then have it update in main thread)
2) the time to just launch case #1 above and return
3) The time to push work off on the background and then never call back the main thread with the result.
What surprised me was case #3 while being 1/2 the time of case #1 for the performSelector/NSThread approach was significantly more effecient for the blocks/gcd case (a factor of 7 times faster, not the expected 2)
Finally on latency/throughput -- I was trying to answer a very specific question -- how much 'work' would you have to incur before considering pushing the work off to a background task.
My use case was designed around the idea of being called from the event loop and needing to eventually update the UI on the main thread.
Anyway thanks again for the good post.
As for latency/throughput, I submit that throughput is still the more important variable. You would never background a single-threaded task that needed to update the GUI in a swift manner, because there's no advantage over simply performing that task directly in the main thread. The reason to use this pattern is to allow the main thread to do other work in the mean time, which means that it's throughput that counts. As an example, imagine that you have a task which takes 50us to execute directly, 75us to roundtrip through GCD, and GCD can process through a group of these tasks at an average of 25us apiece. It makes sense to shove this stuff through GCD in this case, even if the roundtrip time is worse.
The one case where latency would really matter is if you have an extremely common task which normally takes an extremely short amount of time but occasionally takes a very long time, such that you want to background it so it doesn't lock up the GUI. In this case, pushing all the short ones onto GCD could lead to a significant slowdown. However, for something like this, you could play it highly conservative and only background it if you anticipate it taking more than, say, 50ms. Of course, with your experimental results we know that 50ms is a highly conservative number, so that's certainly useful.
In Erlang, in which we would say this something like "A = OtherObject ! {processData}, A will be immediately assigned to "processData" (which is just an interned string, like a ruby Symbol). The calling object is supposed to obtain the result of the processData message by waiting for a response from OtherObject itself, which may come at any time, in any order with other calls -- the waiting itself is also concurrent with any other work the Process may be undertaking. Erlang has primitives for setting up send/receive patterns like this, and setting timeouts for when a Process doesn't get a response.
If I've understood it correctly, a rough (I said "rough"!) equivalent of the Erlang convention would be to write "getters" that take a block as a parameter and invoke it asynchronously when it's done, kind of like this:
- (void)getProcessData: (dispatch_queue_t)callbackQueue: (void (^)(NSData *))callbackBlock {
dispatch_async(myQueue, ^{
NSData *localData = [self processData]; // synchronous, non-thread-aware getter
dispatch_async(callbackQueue, callbackBlock);
});
}
Then call it like so:
...code...
[otherObject getProcessData: myQueue : ^(NSData *processData){
...use processData...
...remainder of method here...
}];
...careful, code here runs before the processData code...
I don't think that this is practical for everyday accessors, but could be an interesting pattern to follow for certain cases.
Your "careful" note wouldn't be necessary, because messsage sends can't change the environment they originate from. It's simply impossible for a block of Erlang code to "request" data from another process in a blocking manner. All it can do is send bytes down the pipeline, and do X when it receives bytes into its mailbox. In Erlang programs, though I warn you I'm pretty inexperienced in this as well, the pattern of code is more likely to start with a "input" process (or processes -- because they share nothing any one process can be made into hundreds to improve multiprocessing), which hands execution of to various workers, and then off to various presenters. You don't see the same "russian-doll" style calling like in procedural language, where a runloop frames a program event and and the case of the event jumps into a function, which jumps into another, on and on and then returning out and out and out back to original scope before moving on. Interactions between processes in Erlang, besides the program arguments, are essentally stateless.
It's also possible that a future version of GCD may add the ability for applications to create their own named concurrent dispatch queues.
Your example confused the heck out of me at first, because I assumed that all queues were concurrent (or else, what's the point!), and had to go to the docs to clear that up!
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
Great article, thank you. You say in your article that GCD is lower-level and higher-performance than NSOperationQueue. I was under the impression that NSOperationQueue on Snow Leopard actually uses GCD and the overhead it adds on top of GCD is relatively trivial.
Is my understanding incorrect? Is there a significant performance benefit to using GCD directly in Cocoa applications rather than using NSOperationQueue?
Thanke!
Jeff