Next article: Using Evil for Good
Previous article: Making Xcode Better
Tags: autorelease cocoa objectivec performance
If you've done much Cocoa programming, you've probably run into a situation where you needed to create a local autorelease pool because of some sort of loop. And you've probably run into advice telling you not to create and destroy the pool for every iteration, because that would be slow. I never believed that it could be significant, and I finally took the time to test it today. What's the verdict? Just as I thought, making autorelease pools is really fast.
I wrote a quick test program to allocate a million NSObjects and record the amount of time taken. The program increases the number of autorelease pools used by a factor of ten each time. The first test uses only one pool for the whole run, the second test uses ten, etc. At the end, it uses one pool per object.
You can get the source code to the test program here.
Here are the results, compiled with gcc4 and -Os, run on my PowerBook G4/1.5 with 10.4.6 while relatively unloaded:
Objects | Objects per pool | Run time (seconds) |
1000000 | 1000000 | 1.65 |
1000000 | 100000 | 1.59 |
1000000 | 10000 | 1.48 |
1000000 | 1000 | 1.46 |
1000000 | 100 | 1.43 |
1000000 | 10 | 1.64 |
1000000 | 1 | 2.91 |
Interestingly enough, the run time actually goes down as the number of autorelease pools created goes up in the first part of the run. This is probably because more frequent pool creation means the program's memory footprint is smaller, and so less time is spent mucking about with page tables and the like.
At the end the runtime spikes up heavily. But notice that the worst runtime is only about twice as bad as the best runtime, and this makes sense because at the end we're allocating almost twice as many objects. The final run creates and destroys two million objects (one million NSObjects, one million NSAutoreleasePools), and the best run does only a hair over one million.
So what's the conclusion? Creating and destroying an NSAutoreleasePool is about as expensive as creating and autoreleasing an NSObject. This is really insanely cheap. You might as well create a new pool in every iteration of your loop. The only time playing games with your pool so it only gets destroyed and recreated every N iterations is if you're only creating a handful of objects per loop, and even then your best-case savings will only be 50%, and that's when you're creating a single object and doing nothing else in the loop. And why would you need that to be super ultra blazing fast anyway?
Premature optimization is the root of all evil. So many people who would otherwise agree with this statement still say things like "you may want to release the pool every 10th iteration". The next time you see someone saying something ilke that, gently correct them and point them to this post.
Comments:
The formula was just a wild guess which I then parametrized to fit the given numbers. The “roundness” of the factors is pure coincidence and I was quite astonished myself. Unfortunately I found out why this formula would have to be verified with different data to have any value: If you compute the part 1.5e-10*S^2 alone, you will see get the results
150.000
1.500
0.015
0.000
0.000
0.000
0.000
This shows that effectively only the first realization changes anything at all in the result and that the linear part 1.5e-3*N takes over for all the other results. While this does not yet mean that the relationship is completely wrong (it sort of proves the constant time per object allocation) it strongly suggests that further investigation is necessary for further application. I fear that memory management has a lot more complex time / size curves (see http://ridiculousfish.com/blog/archives/.. for example).
Hmm. I was amazed though when I first saw the result.
Cheers,
Felix
As a final note, from the data so far it seems like rebuilding AR pools does speed you up significantly. If it does it consistently, it’s not a premature optimization – it’s a best practice. After all, it’s not like it adds hideous code bloat – it’s a couple of “by-rote” lines.
I’m not objecting to rebuilding autorelease pools. Quite the contrary, I’m advocating creating a new pool every time. Very frequently you will find posts that say you should make an autorelease pool in your loop, but that for efficiency you should only destroy and recreate the pool every N iterations, where N is some number larger than 1, because creating and destroying an autorelease pool is perceived as slow.
Here’s a random example pulled off the net (http://forums.macrumors.com/showthread.p..):
It is not necessary to create an autorelease pool at every step. It’s often only necessary to create one pool per project. The objects you create and detroy will still be cleared out once each pass through the event loop. Bear in mind every time you create and destroy a pool, there is some overhead involved, so you generally only want to create subpools when there is a real need to, such as when lots of memory is being allocated in a loop. Even in that case, you often won’t want to do a pool for each loop iteration.
I see this repeated constantly. But as my data shows, the only time you don’t want to do a pool for each loop iteration is when your loop’s work is utterly trivial, and if it’s so trivial then why are you even doing it, or why do you care if it’s fast?
I’m advocating that if you have a long-running loop that autoreleases objects and you need to make a pool, just make one pool per iteration. Don’t play tricks to reduce the number of pools created, because creating and destroying a pool is, as the title says, fast.
I’d add an Initially, to that sentence, but otherwise, I agree.
That is, unless you know you have a problem, don’t fiddle with it. Optimize these things later. I wouldn’t really advise against playing with the number of pools in general, like
I wouldn’t advise against the use of IMP-caching or other optimizations if the code requires it. Sometimes it even makes sense to actually use different zones (just to further deviate from
the topic
Hope that’s clearer,
Kay
Iterations Objs/Iter t/sec
1e0 1e6 0.741
1e1 1e5 0.603
1e2 1e4 0.589
1e3 1e3 0.596
1e4 1e2 0.604
1e5 1e1 0.692
1e6 1e0 1.491
So, similar results then. Creating/destroying the ARP every time through the loop hurts, but when you use less frequent loops the overhead is varying in a non-obvious way, but not by much. It almost certainly depends on the size of the objects and if they do anything special in -dealloc, too.
===== warming up whatever caches may exist, ignore this
Testing 1000 iterations, allocating 1 objects per iteration...done
total time: 0.010807 seconds
===== done, results follow
Testing 1 iterations, allocating 1000000 objects per iteration...done
total time: 9.243073 seconds
Testing 10 iterations, allocating 100000 objects per iteration...done
total time: 9.074250 seconds
Testing 100 iterations, allocating 10000 objects per iteration...done
total time: 9.706995 seconds
Testing 1000 iterations, allocating 1000 objects per iteration...done
total time: 8.522664 seconds
Testing 10000 iterations, allocating 100 objects per iteration...done
total time: 5.443504 seconds
Testing 100000 iterations, allocating 10 objects per iteration...done
total time: 6.312481 seconds
Testing 1000000 iterations, allocating 1 objects per iteration...done
total time: 11.958108 seconds
1) It's not always possible to avoid autoreleasing objects. Sometimes you have to return non-retained objects to the caller, and the frameworks will autorelease objects in many methods without your control or consent.
2) It's a lot more convenient to use factory methods instead of alloc/init/release. The only reason not to is to keep your memory high-water-mark reasonably low, and the whole point of this performance measurement is that you can achieve that with a liberal sprinkling of autorelease pools instead. (And achieve it better, since the framework may autorelease objects beyond your control.)
perhaps interesting to note that the line seems to have become flatter with more recent iterations of the OS (Objc2) and hardware.
but the trend didn't change.
this is on a mid-2009 13" MBP (5,5) 2.26 Ghz/4Gb running 10.6.8
(64bit)
1 - 1000000 - 0.352932
10 - 100000 - 0.335275
100 - 10000 - 0.328813
1000 - 1000 - 0.328538
10000 - 100 - 0.333225
100000 - 10 - 0.343582
1000000 - 1 - 0.395972
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
If you take the number of simultaneous objects S and the number of allocations N you can reproduce the measured time t spent on allocation and memory management with the formula T=1.5e-10*S^2+1.5e-3*N:
Objects_ O/Pool___ t(ms) S————N————T(ms) T/t
1000000 1000000 1650 1000001 1000001 1650 100.00%
1000000 0100000 1590 0100001 1000010 1502 094.43%
1000000 0010000 1480 0010001 1000100 1500 101.36%
1000000 0001000 1460 0001001 1001000 1501 102.84%
1000000 0000100 1430 0000101 1010000 1515 105.94%
1000000 0000010 1640 0000011 1100000 1650 100.61%
1000000 0000001 2910 0000002 2000000 3000 103.09%
The constant factors 1.5e-10 and 1.5e-3 might be irrelevant for other cases than this, but the following result seems valid:
Total allocation time is directly proportional to N. Memory management time is proportional to S^2.
Felix