Next article: RAOperationQueue, an open-source replacement for NSOperationQueue
Previous article: Key-Value Observing Done Right
Tags: nsoperationqueue osbug
I've been discussing this on cocoa-dev but wanted to make a separate post about it to get the word out further. NSOperationQueue is broken on Leopard and will cause crashes. Don't use it!
The Bug
The bug can be demonstrated very easily with this code:
#import <Foundation/Foundation.h>
@interface Tester : NSObject
{
NSOperationQueue *_queue;
}
- (void)test;
@end
@implementation Tester
- (id)init
{
if((self = [super init]))
{
_queue = [[NSOperationQueue alloc] init];
[_queue setMaxConcurrentOperationCount:1];
}
return self;
}
- (void)test
{
NSInvocationOperation *op = [[NSInvocationOperation alloc]
initWithTarget:self selector:_cmd object:nil];
[_queue addOperation:op];
[op release];
}
@end
int main(int argc, char **argv)
{
[NSAutoreleasePool new];
NSMutableArray *testers = [NSMutableArray array];
int i;
for(i = 0; i < 10; i++)
[testers addObject:[[[Tester alloc] init] autorelease]];
for(Tester *tester in testers)
[tester test];
while(1) sleep(1000);
}
On my machine this will die within 10 seconds, throwing this exception:
*** -[NSInvocationOperation start]: receiver has already started or finished
Which is thrown with this backtrace:
#0 0x96480ff4 in ___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ ()
#1 0x9207ee3b in objc_exception_throw ()
#2 0x92db74de in -[NSOperation start] ()
#3 0x92db7112 in __runop ()
#4 0x902ae1f7 in _pthread_wqthread ()
#5 0x902ae0aa in start_wqthread ()
Note that the bug appears to be hardware-dependent. A bunch of people have tried this code and nobody has ever made it crash on a PowerPC machine yet, so it seems that it's Intel-only.
The bug is also very rare. I discovered it in some audio processing code that was enqueueing roughly 500 NSOperations per second. It would take about an hour to crash. This was, as you can imagine, a whole lot of fun to debug.
The crash will still happen even if the above code is modified to enqueue NSOperations separately rather than doing that unusual trick of having one NSOperation enqueue a new one. Leaving off the call to setMaxConcurrentOperationCount:
also doesn't help. Indeed, just about the only thing that seems to help is to only have a single NSOperationQueue running at a time. Since you can't prevent framework code from running its own NSOperationQueue, this is an untenable solution.
The Cause
NSOperationQueue works by managing a thread pool. The pool is global, which is to say that it's shared among all NSOperationQueues in a process. NSOperationQueue uses a private API to examine system load and spawn more worker threads or kill excess ones as needed to keep the system fully utilized without too much overhead.
From what I was able to determine from examining a crashed program, it appears that there's a race condition in the worker thread pool. They dequeue NSOperations from a set of internal queues, and it appears that when the timing is just wrong, the same NSOperation can get dequeued by two worker threads simultaneously. Since this is never supposed to happen, havoc ensues.
What You Can Do About It
Unfortunately the answer is, "not much". The bug is firmly in Apple's code and at this point it's doubtful that they'll fix it in Leopard. Your options are:
- File a bug. Apple probably won't fix it for Leopard, but if enough people complain maybe they'll change their mind. If you do file a bug, you can reference my bug, which is bug ID 6332143.
- Wait for 10.6.. It appears likely that 10.6 will not share this bug. If you can wait for 10.6 to ship, writing your software to require 10.6 is an easy "fix" for this bug.
- Go back to older concurrency models. Switch to raw threads and you'll (mostly) only have your own bugs to contend with, not Apple's. Depending on your needs, it may not be difficult to implement an NSOperationQueue workalike which supports the subset of NSOperationQueue's capabilities that you need. For the software which led me to find this bug, I wrote my own queue subclass which offered prioritized access to a single worker thread in about a day, and made it lockless to boot.
- Eliminate concurrency. Sometimes NSOperation is used just as an optimization to take advantage of multiple cores, or to avoid blocking a thread. In those cases it may be reasonable to simply drop back to a slower serialized method or to simply block until the work is complete.
Comments:
I'm surprised that it didn't crash in the simulator, though. That uses your regular CPU and, as far as I understand it, the regular Mac OS X Foundation libraries. Are you able to produce the crash in a regular Mac OS X program? Not everybody has been able to, even on Intel machines (although more than enough have been able to that it's obvious the bug is real and pretty widespread), so maybe your setup is just somehow immune.
I haven't used NSOperationQueue although I've been thinking about it which was why I (a) read your post, and (b) became somewhat depressed at your conclusions.
Anyway I've nothing to add except to ask if you think that also filing the bug on OpenRadar (http://openradar.appspot.com/) might help raise it's priority? My apologies if it's already there, I couldn't see it.
Regards,
Matt
Still haven't been able to crash the actual iPhone. Hopefully it doesn't affect ARM.
As for iPhones, I suspect that the bug either doesn't happen on ARM or doesn't happen on single-core machines (or both!) so it's probably safe there.
PowerPC dual core is OK.
Using a single NSOprationQueue is OK, but you can't guarantee that there is exactly one, since who knows what Apple's libs do.
Intel multi-chip (more than 2 cores) is easy to crash, as Mike says.
Using addDependency: to specify that Op B should not run until after Op A completes does not help.
Using separate classes for the NSOperations does not help.
http://www.rogueamoeba.com/utm/2008/12/01/raoperationqueue-an-open-source-replacement-for-nsoperationqueue/
It's not a 100% replacement, but it can be very useful in some circumstances.
So, is it an NSOperationQueue bug or an NSInvocationOperation bug?
Your operation may not detect that it's being run twice. I'd recommend adding a counter ivar:
int32_t runCount
And then doing an atomic increment plus assert at the top of your operation code:
assert(OSAtomicIncrement32(&runCount) == 1)
This will either blow up when it runs twice, or prove that it's not happening.
Also, it may be much more rare since you're doing actual work. How long does your operation take to run? I believe that each operation basically has a small chance of triggering the bug. My testing indicates that this chance is something on the order of 0.000005 per operation. (This is a very rough estimate.) If your operations take enough time to execute such that running a couple hundred thousand of them takes longer than you've let it run, that could explain it as well.
It could also be that your code is somehow causing NSOperationQueue to run slightly differently and avoid the crash altogether.
And lastly it's certainly possible that my analysis of the bug is wrong.
What am I missing?
- My guess is that this is not something which will be fixed by adding an ampersand or something. It seems to be a somewhat more fundamental infrastructure problem which will require a good deal of work to fix, which reduces the probability of getting it fixed. This is pure speculation.
- It's extremely rare. The program in which I discovered this spawns off several hundred NSOperations per second. (I should say spawned, past tense, as it now spawns several hundred RAOperations per second.) Even with this it took an hour to crash on average. A more typical use of NSOperation, from what I've seen, is to do tasks which take multiple seconds each, which enormously increases the probable time-to-crash.
- I discovered this bug about a month ago and as far as I've been able to tell, I am the first. (Nobody I've talked to has said they were familiar with it, and bug report with Apple is still open, although admittedly Apple sometimes leaves bug reports open for years for no reason at all.) This means that it's not being hit often enough for people to notice. That I discovered it doesn't really change that. If this were some sort of catastrophic data loss bug then it might get prioritized, but if Leopard has survived without a fix until now, it'll keep on surviving.
- That it hasn't been found leads me to believe that Apple's use of this API in frameworks is pretty limited.
- Apple has not displayed any particular hurry to fix bad bugs in the past. For one amusing example, try using NSKeyedArchiver to archive the string @"$null" and watch what you get back out. I haven't really tracked that bug but it has probably been there since 10.3. A run through the release notes will turn up a bunch of bugs that never got fixed on 10.X but were fixed for 10.X+1. A big difference is that those bugs tend to be reliable and easy to reproduce, whereas NSOperationQueue breaks only rarely. I don't know if that makes it more or less likely that Apple will fix it.
Anyway, it's just my opinion from watching Apple work that they probably won't fix this until 10.6, but I surely could be wrong.
I think this would be the best solution to get apples attention on the bug.
And perhaps a gzipped xcode project would make it easier for others. Then we could just download and run the test on our respective hardware configuration(s) - :)
NSInvocationOperation *op = [[NSInvocationOperation alloc]
initWithTarget:self selector:_cmd object:nil];
It does not make a difference if using a different selector?
- (void)test2
{
NSLog(@"test2");
}
and changed the NSInvocationOperation to:
NSInvocationOperation *op = [[NSInvocationOperation alloc]
initWithTarget:self selector:@selector(test2) object:nil];
and no longer get the crash. I did get the crash in less than 10 seconds if I left the selector as _cmd
The reason your altered test case does not crash is because it does not do anything after the initial 10 operations have executed. Run Activity Monitor and watch your test program's CPU usage sit at zero. Put a loop around the initial invocation pump and watch it crash.
The example does not crash due to an infinite loop. It is very clear that it is crashing due to an uncaught exception thrown by NSOperation. It is much more difficult to discover, but just as clear, that this is due to a race condition in NSOperationQueue's background worker threads.
If you have any evidence that the crash is due to running out of memory or such then please post it, but this sort of baseless guessing game is worse than useless.
I run a Mac Pro with 2 x 8 core processors. As a newcomer to NSOperationQueue and Cocoa in general I would have spent days looking for the cause of this problem.
Naturally I had to reproduce it my own way.
Significantly I used just one queue from main and found that this crashed also!
My tasks did nothing other than run and return. I stuck a counter on it and caused the queue to clear after it had been loaded by n tasks and waited periodically for the queue to clear, all to no avail. For queue lengths of 1000 sometimes it would crash after 20 thousand iterations sometimes 250 thousand exactly as you said it would.
I then decided to give the tasks something to do, just sum the integers from 1 to 1000 and try some different queue lengths. For queue lengths of 10 it would go to 2 million iterations and then hang. For queues of 100 it would go to 3 million and crash with a malloc "unable to allocate" message. Replacing the summation with a sleep(0.01) it got to almost 3 million iterations before collapsing again with the malloc message.
My feeling is that although these latter cases are collapsing for ostensibly different reasons they are nevertheless related. Is is possible that task memory is not being cleared?
Thanks for the alternative code.
Best wishes
Julius
Viz my main looks like this
int main(int argc, char **argv)
{
NSMutableArray * myArray = [[NSMutableArray alloc]init];
NSOperationQueue * _queue = [[NSOperationQueue alloc] init];
[_queue setMaxConcurrentOperationCount:NSOperationQueueDefaultMaxConcurrentOperationCount];
NSInteger iterationsPerStep = 100;
NSInteger interimIterations = 1000;
NSInteger i;
for(i=0; i < 10000;i++) {
NSInteger j;
for(j = 0; j < interimIterations; j++) {
NSInteger k;
for(k = 0; k < iterationsPerStep; k++) {
Tester * myTesterObj = [[Tester alloc]init];
[myArray addObject:myTesterObj];
NSInvocationOperation *op = [[NSInvocationOperation alloc]initWithTarget:myTesterObj
selector:@selector(test)
object:nil];
[_queue addOperation:op];
} // end for k
[_queue waitUntilAllOperationsAreFinished];
myArray = [[NSMutableArray alloc]init];
} // end for j
NSLog(@"iterations so far = %d",(i+1) * iterationsPerStep * interimIterations);
} // end for i
}
The extra iteration loop is just a convenience to facilitate printing the numbers.
My understanding is that the GC should remove all those now free floating no longer used arrays?
Julius
I'll recast the problem and take a look at that asap.
Julius
Also note that your code isn't going to crash according to what's been discovered about the problem so far. The bug only shows up when there are at least two queues in the program. Of course it would be interesting to try it with just one and see if it still crashes, but if you get your memory usage under control and you still don't crash, try adding a second queue.
Next I removed the array of Task objects and replaced it with just a single task object and thereafter all of the NSInvocationOperation objects were created using a call to that object. As in the previous test I cause a wait for the NSOperationQueue to clear every 1000 NSInvocationOperation objects. It crashes with the "receiver has already started or finished" message after 200,000 iterations.
I changed the queue length to 10 and it ran ok until I stopped it at 18 million iterations. Ditto for a queue length of 100. But for a queue length of 500 I did not get past 2 million iterations (4,000 clearings of the queue) before the dreaded "receiver has already started or finished" message.
Unless due to another error it seems then that the problem can also occur with a single queue. The time before errors is erratic, and sometimes if one is running other programs in the foreground these will hang after a few millions of iterations
Finally I ran the same program but with two queues which I waited to clear after putting 10 operations into each of them. Collapse occurred after 640000 iterations with the "receiver has already started or finished" diagnostic.
To make it possible for others to check these results the sources of these XCode 3.1.2 programs running on Mac Pro Leopard 10.5.6 with Garbage Collection Required can be found at:
http://animatedpaint.co.uk/cocoa-examples-zip/DontUseArrayOfTestObj.zip
http://animatedpaint.co.uk/cocoa-examples-zip/NewDontUseOneQueue.zip
http://animatedpaint.co.uk/cocoa-examples-zip/NewDontUseTwoQueues.zip
Best wishes
Julius
I distilled your example code to its essence, removing the GUI and the dependence on garbage collection, and came up with this:
#import <Foundation/Foundation.h>
@interface MyOperation : NSOperation {} @end
@implementation MyOperation
- (void)main
{
usleep(10000);
}
@end
int main(int argc, char **argv)
{
NSAutoreleasePool *outerPool = [[NSAutoreleasePool alloc] init];
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
unsigned count = 0;
while(1)
{
NSAutoreleasePool *innerPool = [[NSAutoreleasePool alloc] init];
MyOperation *op = [[MyOperation alloc] init];
[queue addOperation:op];
[op release];
if(++count % 50000 == 0)
NSLog(@"%u operations", count);
[innerPool release];
}
[outerPool release];
return 0;
}
This crashes reliably on my Mac Pro, often after just a few seconds. Sometimes it takes much longer. But it's not doing any of the "funny" stuff that other people have previously complained about or suggested might be behind the problem. I'm not sure how I missed it before, perhaps because I was stuck with the idea of testing self-repeating operations instead of simply enqueueing them separately, but the result here seems pretty solid.
If you want to test the command-line version with garbage collection enabled:
* invoke objc_startCollectorThread() at the beginning of the main function,
* invoke -[NSAutoreleasePool drain] instead of -[NSAutoreleasePool release],
* optionally, invoke objc_clear_stack() inside the while loop.
"Garbage Collection Programming Guide > Garbage Collection for Cocoa Essentials > Foundation Tools"
http://developer.apple.com/documentation/Cocoa/Conceptual/GarbageCollection/Articles/gcEssentials.html
[Session started at 2009-02-02 15:27:01 -0800.]
2009-02-02 15:27:16.516 NSOperationCrash[24649:10b] 50000 operations
2009-02-02 15:27:29.030 NSOperationCrash[24649:10b] 100000 operations
2009-02-02 15:27:40.879 NSOperationCrash[24649:10b] 150000 operations
...
...the next day...
...
2009-02-03 14:52:08.011 NSOperationCrash[24649:10b] 209300000 operations
2009-02-03 14:52:26.563 NSOperationCrash[24649:10b] 209350000 operations
2009-02-03 14:52:44.770 NSOperationCrash[24649:10b] 209400000 operations
Then I ran Mike's original code, from the top of this page, which creates the multiple operation queues, and it crashed after 1:15 min:sec.
If anyone knows how I might make my Mac Mini more "sensitive", I'll give it a try and post the results.
Model Name: MacBook Pro 15"
Model Identifier: MacBookPro2,2
Processor Name: Intel Core 2 Duo
Processor Speed: 2.33 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 4 GB
Bus Speed: 667 MHz
Boot ROM Version: MBP22.00A5.B07
System Version: Mac OS X 10.5.6 (9G55)
Kernel Version: Darwin 9.6.0
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 677.22.0)
MD5 (Foundation) = 9402bcd843fa783b44879ad65f6f09be
2009-03-01 12:50:08.726 a.out[755:10b] 50000 operations
2009-03-01 12:50:21.473 a.out[755:10b] 100000 operations
2009-03-01 12:50:31.658 a.out[755:2803] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[MyOperation start]: receiver has already started or finished'
2009-03-01 12:50:31.659 a.out[755:2803] Stack: (
2520822027,
2483650107,
2520821483,
2520821546,
2490044589,
2490043730,
2500748303,
2500747970
)
Trace/BPT trap
I tested the codes with my imac with 2GHz Intel Core 2 Duo and Mac OS X 10.5.6. I wish my post will be helpful.
"Fixed problem with NSOperationQueue and PThreads"
<< http://news.worldofapple.com/archives/2009/03/21/apple-seeds-new-build-of-mac-os-x-1057-seed-notes/
10.5.7 is out now. I'd love to hear if this issue was resolved in that fix.
Thanks very much for your time and this great article!
-- Cyrus
Thanks!
-- Cyrus
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
This is also worrisome because there are a lot of "New to Cocoa" iPhone developers and I've seen a few use NSOperationQueue because of it's simplicity.