mikeash.com: Autorelease is Fast

Posted at 2006-06-07 00:00 | RSS feed (Full text feed) | Blog Index
Next article: Using Evil for Good
Previous article: Making Xcode Better
Tags: autorelease cocoa objectivec performance

Autorelease is Fast

by Mike Ash

If you've done much Cocoa programming, you've probably run into a situation where you needed to create a local autorelease pool because of some sort of loop. And you've probably run into advice telling you not to create and destroy the pool for every iteration, because that would be slow. I never believed that it could be significant, and I finally took the time to test it today. What's the verdict? Just as I thought, making autorelease pools is really fast.

I wrote a quick test program to allocate a million NSObjects and record the amount of time taken. The program increases the number of autorelease pools used by a factor of ten each time. The first test uses only one pool for the whole run, the second test uses ten, etc. At the end, it uses one pool per object.

You can get the source code to the test program here.

Here are the results, compiled with gcc4 and -Os, run on my PowerBook G4/1.5 with 10.4.6 while relatively unloaded:

Objects Objects per pool Run time (seconds)

1000000 1000000 1.65

1000000 100000 1.59

1000000 10000 1.48

1000000 1000 1.46

1000000 100 1.43

1000000 10 1.64

1000000 1 2.91

Interestingly enough, the run time actually goes down as the number of autorelease pools created goes up in the first part of the run. This is probably because more frequent pool creation means the program's memory footprint is smaller, and so less time is spent mucking about with page tables and the like.

At the end the runtime spikes up heavily. But notice that the worst runtime is only about twice as bad as the best runtime, and this makes sense because at the end we're allocating almost twice as many objects. The final run creates and destroys two million objects (one million NSObjects, one million NSAutoreleasePools), and the best run does only a hair over one million.

So what's the conclusion? Creating and destroying an NSAutoreleasePool is about as expensive as creating and autoreleasing an NSObject. This is really insanely cheap. You might as well create a new pool in every iteration of your loop. The only time playing games with your pool so it only gets destroyed and recreated every N iterations is if you're only creating a handful of objects per loop, and even then your best-case savings will only be 50%, and that's when you're creating a single object and doing nothing else in the loop. And why would you need that to be super ultra blazing fast anyway?

Premature optimization is the root of all evil. So many people who would otherwise agree with this statement still say things like "you may want to release the pool every 10th iteration". The next time you see someone saying something ilke that, gently correct them and point them to this post.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Felix at 2006-06-07 19:59:00:

Interesting numbers!

If you take the number of simultaneous objects S and the number of allocations N you can reproduce the measured time t spent on allocation and memory management with the formula T=1.5e-10*S^2+1.5e-3*N:

Objects_ O/Pool___ t(ms) S————N————T(ms) T/t
1000000 1000000 1650 1000001 1000001 1650 100.00%
1000000 0100000 1590 0100001 1000010 1502 094.43%
1000000 0010000 1480 0010001 1000100 1500 101.36%
1000000 0001000 1460 0001001 1001000 1501 102.84%
1000000 0000100 1430 0000101 1010000 1515 105.94%
1000000 0000010 1640 0000011 1100000 1650 100.61%
1000000 0000001 2910 0000002 2000000 3000 103.09%

The constant factors 1.5e-10 and 1.5e-3 might be irrelevant for other cases than this, but the following result seems valid:

Total allocation time is directly proportional to N. Memory management time is proportional to S^2.

Felix

foobaz at 2006-06-07 23:16:00:

Thanks for the excellent research.

mikeash at 2006-06-08 04:04:00:

Felix, your numbers are very interesting. The 1.5s in your formula seem like very round numbers, considering how close a fit they give. Where did that formula come from, is it just a straight curve fit that happens to give exactly 1.5, or is there some theoretical basis to it?

Felix at 2006-06-08 19:51:00:

Hi.

The formula was just a wild guess which I then parametrized to fit the given numbers. The “roundness” of the factors is pure coincidence and I was quite astonished myself. Unfortunately I found out why this formula would have to be verified with different data to have any value: If you compute the part 1.5e-10*S^2 alone, you will see get the results

150.000
1.500
0.015
0.000
0.000
0.000
0.000

This shows that effectively only the first realization changes anything at all in the result and that the linear part 1.5e-3*N takes over for all the other results. While this does not yet mean that the relationship is completely wrong (it sort of proves the constant time per object allocation) it strongly suggests that further investigation is necessary for further application. I fear that memory management has a lot more complex time / size curves (see http://ridiculousfish.com/blog/archives/.. for example).

Hmm. I was amazed though when I first saw the result.

Cheers,

Felix

Robert 'Groby' Blum at 2006-06-16 16:55:00:

Hm. That’s not enough data to convince me either way. There’s a couple of things that I’d like to see. One is finding oy why the jump from 1 to 10 autorelease pools occurs – that 50% speedup is just plain mysterious. And, BTW, is an excellent justification for re-creating autorelease pools every time. If – and that’s the second thing – this advantage actually occurs for any number of objects. I’ve got no idea if we’ll see the same drastic speedup if we only have 10, 100, or 1000 objects.

As a final note, from the data so far it seems like rebuilding AR pools does speed you up significantly. If it does it consistently, it’s not a premature optimization – it’s a best practice. After all, it’s not like it adds hideous code bloat – it’s a couple of “by-rote” lines.

Mike Ash at 2006-06-16 21:31:00:

The jump from 1 to 10 is very obvious. An NSAutoreleasePool is an object like any other. The work in the loop is allocating a single object. By making an destroying one autorelease pool per loop iteration, you suddenly double the amount of work that’s being done in the loop. By backing off to doing it only one in ten loops, you’re only adding 10% to the work that’s done per iteration. If you take 2.91, divide by two, then add 10%, you get 1.60, which almost perfectly matches the data.

I’m not objecting to rebuilding autorelease pools. Quite the contrary, I’m advocating creating a new pool every time. Very frequently you will find posts that say you should make an autorelease pool in your loop, but that for efficiency you should only destroy and recreate the pool every N iterations, where N is some number larger than 1, because creating and destroying an autorelease pool is perceived as slow.

Here’s a random example pulled off the net (http://forums.macrumors.com/showthread.p..):

It is not necessary to create an autorelease pool at every step. It’s often only necessary to create one pool per project. The objects you create and detroy will still be cleared out once each pass through the event loop. Bear in mind every time you create and destroy a pool, there is some overhead involved, so you generally only want to create subpools when there is a real need to, such as when lots of memory is being allocated in a loop. Even in that case, you often won’t want to do a pool for each loop iteration.

I see this repeated constantly. But as my data shows, the only time you don’t want to do a pool for each loop iteration is when your loop’s work is utterly trivial, and if it’s so trivial then why are you even doing it, or why do you care if it’s fast?

I’m advocating that if you have a long-running loop that autoreleases objects and you need to make a pool, just make one pool per iteration. Don’t play tricks to reduce the number of pools created, because creating and destroying a pool is, as the title says, fast.

Kay Röpke at 2006-09-01 16:16:00:

Dont play tricks to reduce the number of pools created, because creating and destroying a pool is, as the title says, fast.

I’d add an Initially, to that sentence, but otherwise, I agree.

mikeash at 2006-09-03 03:01:00:

I can’t figure out where your “initially” would go, can you be more specific about what you mean?

Kay Röpke at 2006-09-04 13:08:00:

Oh, I meant: “Initially, dont play tricks to reduce the number of pools created, because creating and destroying a pool is, as the title says, fast.”
That is, unless you know you have a problem, don’t fiddle with it. Optimize these things later. I wouldn’t really advise against playing with the number of pools in general, like
I wouldn’t advise against the use of IMP-caching or other optimizations if the code requires it. Sometimes it even makes sense to actually use different zones (just to further deviate from
the topic

Hope that’s clearer,

Kay

mikeash at 2006-09-13 19:49:00:

Got it. Yes, that makes sense. Most of the problematic advice stems from attempting to optimize before measuring.

leeg at 2006-12-14 15:47:00:

I’d had similar thoughts about ARPs and had decided to test that on the GNU runtime. For fair comparison, I’ve now blagged your code and run it under GNUstep-base (latest from subversion), gnu-gnu-gnu on my Linux laptop. CPU is a Pentium M running at 800MHz, 2MB cache; box has 512MB RAM.

Iterations Objs/Iter t/sec
1e0 1e6 0.741
1e1 1e5 0.603
1e2 1e4 0.589
1e3 1e3 0.596
1e4 1e2 0.604
1e5 1e1 0.692
1e6 1e0 1.491

So, similar results then. Creating/destroying the ARP every time through the loop hurts, but when you use less frequent loops the overhead is varying in a non-obvious way, but not by much. It almost certainly depends on the size of the objects and if they do anything special in -dealloc, too.

Xing Pengfei at 2010-06-30 04:24:23:

Great post, just tried on iPhone 3GS, and get the following data which matches the one mentioned in your article. Thanks!

===== warming up whatever caches may exist, ignore this
Testing 1000 iterations, allocating 1 objects per iteration...done
total time: 0.010807 seconds

===== done, results follow

Testing 1 iterations, allocating 1000000 objects per iteration...done
total time: 9.243073 seconds

Testing 10 iterations, allocating 100000 objects per iteration...done
total time: 9.074250 seconds

Testing 100 iterations, allocating 10000 objects per iteration...done
total time: 9.706995 seconds

Testing 1000 iterations, allocating 1000 objects per iteration...done
total time: 8.522664 seconds

Testing 10000 iterations, allocating 100 objects per iteration...done
total time: 5.443504 seconds

Testing 100000 iterations, allocating 10 objects per iteration...done
total time: 6.312481 seconds

Testing 1000000 iterations, allocating 1 objects per iteration...done
total time: 11.958108 seconds

mikeash at 2010-06-30 04:55:25:

Very cool, I'm impressed at how closely it follows the original curve even on completely different hardware.

Mason at 2010-09-28 02:18:20:

I don't see why any of this is necessary if, as good boys, we all just stick clear of factory-methods, and [obj release] our temp objects each iteration. Done, done and done a lot quicker than any autoreleasepool method.

mikeash at 2010-09-28 02:49:09:

Two reasons:

1) It's not always possible to avoid autoreleasing objects. Sometimes you have to return non-retained objects to the caller, and the frameworks will autorelease objects in many methods without your control or consent.

2) It's a lot more convenient to use factory methods instead of alloc/init/release. The only reason not to is to keep your memory high-water-mark reasonably low, and the whole point of this performance measurement is that you can achieve that with a liberal sprinkling of autorelease pools instead. (And achieve it better, since the framework may autorelease objects beyond your control.)

arri at 2011-07-02 09:07:08:

good to finally 'know' - i was suspecting something like this and acting accordingly. thanks for setting it straight.

perhaps interesting to note that the line seems to have become flatter with more recent iterations of the OS (Objc2) and hardware.
but the trend didn't change.

this is on a mid-2009 13" MBP (5,5) 2.26 Ghz/4Gb running 10.6.8
(64bit)

1 - 1000000 - 0.352932
10 - 100000 - 0.335275
100 - 10000 - 0.328813
1000 - 1000 - 0.328538
10000 - 100 - 0.333225
100000 - 10 - 0.343582
1000000 - 1 - 0.395972

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.

Objects	Objects per pool	Run time (seconds)
1000000	1000000	1.65
1000000	100000	1.59
1000000	10000	1.48
1000000	1000	1.46
1000000	100	1.43
1000000	10	1.64
1000000	1	2.91