Next article: Friday Q&A 2009-10-02: Care and Feeding of Singletons
Previous article: Friday Q&A 2009-09-18: Intro to Grand Central Dispatch, Part IV: Odds and Ends
Tags: fridayqna gcd performance sourcecode
Welcome back to another Friday Q&A. I'm off to C4 today (hope to see you there!) but I've prepared this in advance so everyone stuck at home (or worse, work) can at least have something interesting to read. Over the past four weeks I've introduced Grand Central Dispatch and discussed the various facilities it provides. In Part I I talked about the basics of GCD and how to use dispatch queues. In Part II I discussed how to use GCD to extract more performance from multi-core machines. In Part III I discussed GCD's event dispatching mechanism, and in Part IV I took care of various odds and ends that I hadn't covered before. This week I'm going to examine a practical application of using GCD to speed up the production of thumbnails for a large quantity of images, a topic suggested by Willie Abrams.
Overview
I'm going to walk through the parallelization of this program in four steps. The first step will be the basic serialized program, and the following steps work through building it into a fully parallel program using GCD. If you'd like to follow along, you can get the full source code for all four steps. Don't run imagegcd2.m
though. You'll see why in a bit.
The Original Program
The program that we're going to work with is a simple thing that goes through the contents of ~/Pictures
and generates thumbnails for everything inside. It's a pure command-line program, albeit using Cocoa to do most of the work. This is what its main function looks like:
int main(int argc, char **argv)
{
NSAutoreleasePool *outerPool = [NSAutoreleasePool new];
NSApplicationLoad();
NSString *destination = @"/tmp/imagegcd";
[[NSFileManager defaultManager] removeItemAtPath: destination error: NULL];
[[NSFileManager defaultManager] createDirectoryAtPath: destination
withIntermediateDirectories: YES
attributes: nil
error: NULL];
Start();
NSString *dir = [@"~/Pictures" stringByExpandingTildeInPath];
NSDirectoryEnumerator *enumerator = [[NSFileManager defaultManager] enumeratorAtPath: dir];
int count = 0;
for(NSString *path in enumerator)
{
NSAutoreleasePool *innerPool = [NSAutoreleasePool new];
if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
{
path = [dir stringByAppendingPathComponent: path];
NSData *data = [NSData dataWithContentsOfFile: path];
if(data)
{
NSData *thumbnailData = ThumbnailDataForData(data);
if(thumbnailData)
{
NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg", count++];
NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
[thumbnailData writeToFile: thumbnailPath atomically: NO];
}
}
}
[innerPool release];
}
End();
[outerPool release];
}
imagegcd1.m
. The important parts are all here, though. Start
and End
are just simple timing functions using gettimeofday
. ThumbnailDataForData
uses Cocoa to load the data into an image, shrink it proportionally to be no larger than 320x320, and then encodes the result as JPEG.
Naïve Parallelization
At first glance this looks pretty easy to parallelize. Each iteration through the loop can be pushed onto a GCD global queue. We can wait for them all to finish at the end by using a dispatch group. One last trick: to ensure that each iteration still gets a unique number for its filename, we'll use OSAtomicIncrement32
to atomically increment count
. This is what the new code looks like:
dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
dispatch_group_t group = dispatch_group_create();
__block uint32_t count = -1;
for(NSString *path in enumerator)
{
dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
{
NSString *fullPath = [dir stringByAppendingPathComponent: path];
NSData *data = [NSData dataWithContentsOfFile: fullPath];
if(data)
{
NSData *thumbnailData = ThumbnailDataForData(data);
if(thumbnailData)
{
NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
OSAtomicIncrement32(&count;)];
NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
[thumbnailData writeToFile: thumbnailPath atomically: NO];
}
}
}
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
imagegcd2.m
. But don't run it!
If you ignored my warning and ran it anyway, you're probably just reloading this page after rebooting your computer. If you haven't run it, what happens (if you have a lot of pictures, at least) is that your computer locks up and you probably can't fix it unless you wait much longer than you'd really like to.
The Problem
What's causing all this trouble? The problem lies in GCD's smarts. GCD runs tasks on a global thread pool whose size is scaled in response to system load. For example, my computer has four cores, and so if I load up GCD with work, GCD will run four worker threads to load every core. If something else on my computer starts doing work, GCD will scale back a bit to give the other task some room.
However, GCD can also increase the number of active threads. It will do this if one of the worker threads blocks. Imagine these four worker threads running and then suddenly one of them does something like, oh, let's say, read a file. It goes off to wait for the disk, and your cores are being under-utilized. GCD will see this situation and spawn another worker thread to fill the gap.
Now, think about what happens here. The main loop is pushing jobs onto the global queue extremely quickly. GCD will start off with a few worker threads and start popping jobs off the queue. These jobs perform a trifling amount of work up front and then immediately they go off and read a file from the disk. The slow, spinning disk.
And let's not forget another important property of the disk: unless you have an SSD or a fancy RAID, they get substantially slower under contention.
These first four jobs all hit the disk at the same time, which goes crazy trying to fill all four requests at once. GCD, which only looks at CPU usage, sees that the CPU cores are sitting mostly idle and starts spawning more worker threads. These threads also slam into the disk wall, causing GCD to spawn yet more threads, etc.
Eventually the file reads begin to complete. Now, instead of four threads for the four cores, there are hundreds. GCD will scale back if there are too many worker threads using CPU time, but it's limited in when it can scale back. It can't kill worker threads in the middle of a job, and can't even pause them. It has to wait until an entire job has completed before it can kill the thread that job is on. All of these pending in-flight jobs prevent GCD from reducing the worker thread count.
All these hundreds of threads start to finish reading their image data and begin process. They get in each other's way on the CPU as well, although the CPU handles contention much better than the disk. The trouble is, the first thing thing these threads do once they have the file data is decode it. If you have a lot of JPEGs, this image data is going to expand by a factor of 10 or more. With hundreds of these things in flight, you'll start to blow out your memory. What happens when you run out of physical RAM? More disk usage!
Now you have a vicious feedback cycle. Disk contention causes more worker threads, which causes more memory usage, which causes more disk contention. The process runs away until GCD hits its limit of 512 worker threads. With typical picture sizes, 512 in-flight jobs is more than enough to send your system into swap hell from which it will take a long time to recover. Quite likely you won't even be able to kill the job for quite some time.
This is something you really have to watch out for when using GCD. GCD is great for limiting the number of concurrent jobs for CPU usage, but it will do nothing about contention over other resources. If your jobs do IO or anything else that could block for a while, you need to beware of this problem.
The Fix
The root of this whole problem was IO contention leading to runaway feedback. Remove the contention, remove the problem.
GCD makes this easy with custom queues. Custom queues are inherently serialized. If we create a custom queue just for IO and put all file reading/writing onto that queue, then the disk will only be hit up for one file at a time and the contention disappears.
Here's the main loop of our program redone to use an IO queue:
dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
dispatch_group_t group = dispatch_group_create();
__block uint32_t count = -1;
for(NSString *path in enumerator)
{
if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
{
NSString *fullPath = [dir stringByAppendingPathComponent: path];
dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
NSData *data = [NSData dataWithContentsOfFile: fullPath];
if(data)
dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
NSData *thumbnailData = ThumbnailDataForData(data);
if(thumbnailData)
{
NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
OSAtomicIncrement32(&count;)];
NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
[thumbnailData writeToFile: thumbnailPath atomically: NO];
}));
}
}));
}));
}
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
imagegcd3.m
. It's great how easy GCD makes it to push different parts of a task onto different queues with some simple nesting. This one will behave fairly well... most of the time.
The problem is that it's inherently unstable because the different parts are not synchronized. The flow of data in this code looks like this:
Main Thread IO Queue Concurrent Queue
find paths ------> read -----------> process
...
write <----------- process
Now imagine a machine where the disk is fast enough to read files faster than the CPU can process them. This isn't all that hard to imagine: although the CPU is much faster, it's also doing much more work. The data read from the disk begins to pile up in the queue. This data takes up memory, possibly substantial amounts of memory if you have a lot of big pictures.
Then you run out of physical RAM and begin to swap.
This can lead to another runaway feedback loop like the first one. If anything causes the worker thread to block, GCD will spin off a new one, which will immediately start trying to allocate a bunch of memory and block because of the ongoing memory pressure. GCD will spin off more jobs, causing more memory pressure, and you're back in swap hell again.
What's interesting about this feedback is that, unlike the first GCD attempt, it's self-regulating to some extent. As IO contention goes through the roof, the IO queue will come to a halt, and won't make any significant progress until the situation has regained sanity. Once it does, you're back to low memory usage and good throughput until the buffered data builds up too far again.
End result: the program alternates between smooth processing and being bogged down.
Note that if the disk is slower the same problem can still occur because the thumbnails will be buffered at the end of the run, but it's likely to be much less severe because the quantity of data is so much smaller.
Really Fixing the Problem
Since the problem with the last attempt was a lack of synchronization between the different phases of the operation, let's synchronize them. The simple way to do this is to use a semaphore to limit the number of jobs in flight at any given time.
One question remains: how many jobs should we allow?
Obviously it should scale with the number of CPU cores in the system, because we want to take advantage of whatever is available. Simply limiting to the number of CPU cores is a bad idea, though, because much of each job is IO. And it can't be too high, because then we'll run out of memory.
I decided on having twice the number of jobs as CPU cores. My reasoning is that this will scale up to the point where IO takes as long as processing. If IO takes longer than processing, then IO will be the bottleneck anyway, and so there's no sense in having more concurrent jobs than this. If IO takes significantly less time than processing, then GCD will automatically keep the number of worker threads low enough to ensure minimal contention on the CPU.
This is what the main loop now looks ilke:
dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
int cpuCount = [[NSProcessInfo processInfo] processorCount];
dispatch_semaphore_t jobSemaphore = dispatch_semaphore_create(cpuCount * 2);
dispatch_group_t group = dispatch_group_create();
__block uint32_t count = -1;
for(NSString *path in enumerator)
{
WithAutoreleasePool(^{
if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
{
NSString *fullPath = [dir stringByAppendingPathComponent: path];
dispatch_semaphore_wait(jobSemaphore, DISPATCH_TIME_FOREVER);
dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
NSData *data = [NSData dataWithContentsOfFile: fullPath];
dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
NSData *thumbnailData = ThumbnailDataForData(data);
if(thumbnailData)
{
NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
OSAtomicIncrement32(&count;)];
NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
[thumbnailData writeToFile: thumbnailPath atomically: NO];
dispatch_semaphore_signal(jobSemaphore);
}));
}
else
dispatch_semaphore_signal(jobSemaphore);
}));
}));
}
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
Benchmarking
I obtained the following runtimes, on a library of 7913 pictures:
Program | Time (seconds) |
---|---|
imagegcd1.m | 984 |
imagegcd2.m | did not run |
imagegcd3.m | 300 |
imagegcd4.m | 279 |
It's interesting that version 3 performed as well as it did. I did observe it exhibiting the cycling behavior I discussed, but not too often. Most likely this is because my machine has 15GB of RAM. On a less well endowed system it's likely to perform substantially worse. I observed it using up to 10GB of RAM at one point. If I compile it as 32-bit then it rapidly runs out of virtual memory and crashes. Version 4 never uses any significant amout of RAM.
Conclusion
GCD is a fantastic piece of technology and does a lot of useful things, but it can't do everything for you. In particular, concurrent jobs which perform IO and have the potential to use a lot of memory must be managed carefully. Even so, the facilities that GCD provides make it easy to construct a system which will not overwhelm the computer's resources.
That wraps up this week's Friday Q&A. Come back next week for another exciting edition. In the mean time, send me your topics to discuss!
Comments:
E.g. in a future GCD version to also include IO & memory characteristics to control the amount of worker threads?
@kusmi: I think that's a slippery slope. In order for that to be useful, GCD (and, by extension, the kernel) would need to profile every thread to know its specific impact on the system.
It would be a net loss for GCD to throttle a queue if the jobs on that CPU are purely CPU-bound, just because another process on the system is hitting the disk or using a lot of memory.
It's impossible for GCD as it stands right now to be able to properly deal with these things, because it can't predict the future. There's no way for GCD to know that a certain thread is going to block on IO, and no way to know whether creating more worker threads is a good idea or not when it does. However, with the addition of more explicit IO facilities it certainly could. For now, at least it provides more than enough infrastructure for you to solve the problem yourself with a fairly small amount of additional work.
Since memory is the problem resource after dealing with the IO contention, why not have your semaphore represent a memory cap in KB or MB? Would it be inefficient to call dispatch_semaphore_wait repeatedly (proportional to the amount of memory you estimate the image needing) before dispatching the operation?
Like natevw, i find it a little bit disappointing that we have to limit concurrency based on cpuCount, when GCD is supposed to abstract that away from us. GCD is a nice abstraction, but like any technology, it can be a leaky abstraction...
Like you say in the comments, the problem seems to be that there's no way for GCD to know that a certain thread is going to block on IO. Thus, maybe one option for "GCD 2.0" would be to allow us GCD users to add a hint in the dispatch calls to let GCD know that the code to run will likely block on IO at some point. A simple flag like this would not be able to describe to GCD all the subtle things that you are going to need with IO, but it might be sufficient to let GCD make the right decision and avoid most of the cases you describe.
In addition, there is one thing that seems like GCD/OS X should be better at, and that is to avoid the swap hell. Why does GCD even start more threads when the OS is struggling with memory? Since GCD never makes any promises on when it will run any of the code, couldn't it simply stop creating more threads? Are there any reason for GCD to think that creating more threads could solve the memory swaps and that it should just keep happily going? I am really curious what I am missing here. Thanks for your insights :-)
Ultimately, though, I don't see a need for this. You never need more than CPU * 2 jobs in flight at once. The only reason to limit based on memory rather than CPU is if you expect even that small number of jobs to run out of memory. Definitely not the case here. I suppose if you had really big jobs then it might be worthwhile, but since it's extremely difficult to know the memory state of the system as a whole, probably not.
charles: Ultimately, your "IO flag" is simply a way of saying to GCD, "don't start a bunch of new jobs just because this one blocked". Well, that's exactly what using a serial queue for all IO does.
As for why GCD spawns more threads, it has no idea what the effect of doing this will be. Maybe your jobs are already swapped in, won't allocate more memory, and thus will run along maxing out the CPU just fine. Extra jobs could even deallocate memory, for all it knows. There's just no way it can look at the state of the system and conclude, "The system is swapping a lot, thus I should refrain from spawning more jobs."
Imagine if you decide to make 20 "units" (whatever one count of the semaphore represents) of memory available. Then you have 5 jobs which each need 5 units to run. Because the only way to grab those units is to wait on the semaphore sequentially, you have a potential for deadlock. If all jobs manage to grab 4 units at the same time, then none can grab the 5th to be able to proceed. You'd have to use non-blocking waits and complicated logic to back off and try again. Not good.
A better way would be to use a sort of custom semaphore, using a lock and a condition variable to wrap a counter, which would allow threads to atomically decrement it.
One problem is that, while this will help with the problem of bringing your machine to a halt, it'll still give you poor performance on most systems if you don't explicitly serialize your IO. Limiting the number of jobs in flight is good, but if they're still fighting over the hard drive when they start up then your performance is not going to be as good as it could be.
I am new to multi-threading, and I must say I have learned a TON from these articles. It has helped me understand GCD tremendously.
One question I have is why you made WithAutoreleasePool() method that takes a block of code and wraps it between a [NSAutoreleasePool newPool] and release. Is there a reason for this? I find it makes the code harder to read for me, is it supposed to make it easier to read or is it for performance or neither?
Thanks again,
Robby
For some more discussion of some non-GCD ways to take advantage of blocks in your code, you might be interested in reading this article I wrote from last winter which covers some ideas:
http://mikeash.com/?page=pyblog/friday-qa-2008-12-26.html
For SSD disks serial I/O queue is not very good. These drives handle parallel access fine (and I expect in the future to be even better at this, as it's easy to parallelize flash).
I'm tempted to write my own dispatch_* wrappers that take file path and amount of memory as a hint :)
I'm surprised that you bemoan the lack of "a sort of GCD for IO" since GCD has a solution for IO which I know you're familiar with: dispatch sources.
Basically, GCD's philosophy is to never block. All threads should always be able to advance on their current task at all times. So, one should use asynchronous, non-blocking IO techniques whenever possible. Dispatch sources provide a means for doing asynchronous, non-blocking IO that (obviously) integrates well with GCD.
I fixed your imagegcd2.m by implementing two utility functions:
void WithDataFromFile(NSString* filePath, dispatch_queue_t queue, void (^block)(NSData* data));
void AfterWritingDataToFile(NSData* data, NSString* filePath, dispatch_queue_t queue, void (^block)(BOOL))
The first creates a dispatch source to read from the given file path, accumulating an NSData. When it reaches EOF, it submits the provided block to the specified queue and passes in the data object. In event of error, nil is passed in.
The second function creates a dispatch source to write the given data object to the given path. When it's done, it submits the block to the queue, passing a success flag.
Both are pretty straightforward, and were based on the sample code that Apple provides in the Concurrency Programming Guide. I'm not posting them because they're a touch too large for a blog comment, but can if you like.
The read source is scheduled on the low-priority queue, and the write source on the high-priority queue. My reasoning is that reading acquires work and memory load while writing eliminates work and relieves memory load.
Where your code has:
NSData *data = /* ... */;
if(data) /* ... */
I've replaced it with:
<code
dispatch_retain(group);
dispatch_group_enter(group);
WithDataFromFile(fullPath, globalQueue, ^(NSData* data){
if (data) WithAutoreleasePool(^{
NSData *thumbnailData = ThumbnailDataForData(data);
if(thumbnailData)
{
NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg", OSAtomicIncrement32(&count)];
NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
AfterWritingDataToFile(thumbnailData, thumbnailPath, globalQueue, ^(BOOL success){
dispatch_group_leave(group);
dispatch_release(group);
});
}
});
});
Still, very nice technique, and thanks for posting it. Very useful in its own right.
#define DISPATCH_QUEUE_WIDTH_ACTIVE_CPUS -1
#define DISPATCH_QUEUE_WIDTH_MAX_PHYSICAL_CPUS -2
#define DISPATCH_QUEUE_WIDTH_MAX_LOGICAL_CPUS -3
void dispatch_queue_set_width(dispatch_queue_t dq, long width);
This function is currently implemented but is private.
With it, we could limit the number of jobs the queue spawns, avoiding all the issues mentioned above.
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.