mikeash.com: Performance Comparisons of Common Operations, Leopard Edition

Posted at 2008-01-12 21:01 | RSS feed (Full text feed) | Blog Index
Next article: A Tool for Editing Version-Controlled Bundles
Previous article: The Cults of Programming
Tags: cocoa leopard objectivec performance

Performance Comparisons of Common Operations, Leopard Edition

by Mike Ash

By popular demand, I have re-run my Performance Comparisons of Common Operations on the same hardware but running Leopard.

You can see the original article here. I used the exact same program as before, which you can get here. There is one change to the hardware, in that the computer now has 7GB of RAM instead of 3GB. I do not expect this to influence much. It still has the 2.66GHz CPUs and the stock 250GB hard drive. Here's the new chart:

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 1000000000 0.7 0.7
C++ virtual method call 1000000000 1.1 1.1
Integer division 1000000000 2.4 2.4
Objective-C message send 1000000000 4.9 4.9
Float division with int conversion 100000000 0.9 9.0
Floating-point division 100000000 0.9 9.2
16 byte memcpy 100000000 2.9 28.9
16 byte malloc/free 100000000 5.6 56.0
NSInvocation message send 10000000 0.8 77.3
NSObject alloc/init/release 10000000 2.9 290.5
NSAutoreleasePool alloc/init/release 10000000 3.6 357.7
16MB malloc/free 100000 0.4 4485.2
NSButtonCell creation 1000000 6.6 6640.5
Read 16-byte file 100000 2.1 21219.3
Zero-second delayed perform 100000 4.2 42211.8
pthread create/join 10000 0.6 56633.2
NSButtonCell draw 100000 6.9 69400.5
1MB memcpy 10000 1.2 123001.8
Write 16-byte file 10000 4.9 492040.5
Write 16-byte file (atomic) 10000 8.7 867380.7
NSTask process spawn 1000 6.1 6096478.5
Read 16MB file 100 2.9 28619582.6
Write 16MB file (atomic) 30 10.7 356168718.8
Write 16MB file 30 10.9 361767086.5

As you would expect, the low-level stuff that doesn't touch the OS is pretty much unaffected, and any changes are well within the margin of error. Things like Objective-C message sends really can't get much faster than they already are, and are unchanged. However, there are some interesting changes from the Tiger numbers.

Sending a message with NSInvocation on Tiger took about 160ns per message, but on Leopard it took only 77ns. That's over a factor of two faster. It's still over ten times slower than a straight message send, but it's much better.

Allocating and destroying Objective-C objects has apparently become significantly slower. [[[NSObject alloc] init] release] on Tiger took under 190ns, but on Leopard it took 290ns. This is still well into ignorable territory in most situations, but taking 50% more time is not good. I'm not sure what changes would have been made to make this slower. The small malloc/free test was pretty much unaffected, so it's apparently something specific to Objective-C.

The 16MB malloc/free test ran in less than half the time on Leopard. At this size, malloc/free hit the kernel directly, so presumably there is some syscall or kernel memory management optimization at work.

Delayed performs are somewhat worse on Leopard, going from 30µs each to 42µs.

NSButtonCell drawing got a bit faster, although not significantly. Leopard probably has various drawing optimizations, since that's something that the OS does quite a lot of.

Pthread creation is about twice as fast on Leopard. It's still annoyingly slow, but now it's something you can do about 20,000 times a second instead of 10,000 times a second.

And lastly, the atomic 16-byte file write is a bit over 10% faster. Perhaps there are some filesystem optimizations affecting the atomic swap process that this uses.

Update: The above was compiled as 32-bit, and it was pointed out that it might be good to show 64-bit numbers as well:

Name	Iterations	Total time (sec)	Time per (ns)
IMP-cached message send	1000000000	0.8	0.8
C++ virtual method call	1000000000	1.1	1.1
Integer division	1000000000	2.4	2.4
Objective-C message send	1000000000	8.6	8.6
Float division with int conversion	100000000	0.9	9.0
Floating-point division	100000000	0.9	9.0
16 byte memcpy	100000000	2.9	29.3
16 byte malloc/free	100000000	5.3	52.7
NSInvocation message send	10000000	0.8	81.6
NSAutoreleasePool alloc/init/release	10000000	1.7	169.5
NSObject alloc/init/release	10000000	1.9	192.6
16MB malloc/free	100000	0.3	2924.6
NSButtonCell creation	1000000	6.1	6069.9
Read 16-byte file	100000	1.7	17382.1
Zero-second delayed perform	100000	3.4	33858.4
pthread create/join	10000	0.6	56279.8
NSButtonCell draw	100000	6.5	64812.5
1MB memcpy	10000	1.2	122725.0
Write 16-byte file	10000	4.9	486454.7
Write 16-byte file (atomic)	10000	8.6	859274.8
NSTask process spawn	1000	5.6	5551589.2
Read 16MB file	100	2.8	27785650.2
Write 16MB file	30	9.2	305342338.2
Write 16MB file (atomic)	30	9.2	306369990.9

There are definitely some interesting differences here. Objective-C object allocation is back down to the Tiger numbers. I have even less of an idea why it would only be slower on Leopard 32-bit but not 64-bit. The 16MB malloc/free is even faster under 64-bit, and delayed peforms are significantly faster as well. Everything else seems to be pretty much the same, with some margin of error.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

leeg at 2008-01-13 05:22:24:

Just out of interest, which of Leopard's Objective-C runtimes were you using? I expect the 64-bit one would be faster (but still would hold off optimising for it until I actually had a speed problem)

mikeash at 2008-01-13 05:36:14:

That's a good point. The original numbers were done with 32-bit code. I've updated the post to show 64-bit numbers as well.

Ahruman at 2008-01-13 05:59:09:

Woo, my demands are popular! Perhaps they’re marketable?

The link in the first paragraph is broken in the RSS feed. Probably this is some not-very-interesting issue in your software regarding URLs with queries in them.

mikeash at 2008-01-13 06:31:24:

Thanks for pointing that out. It's actually an issue with my RSS feed not knowing how to rewrite relative links in the post. I have fixed this by the simple expedient of removing the link from the first paragraph.

Chris at 2008-01-13 10:39:56:

There is one change to the hardware, in that the computer now has 7GB of RAM instead of 3GB. I do not expect this to influence much.

I do not share your sanguineness, I'm afraid. This could easily skew results. The kernel used to allocate certain caches as a percentage of available RAM; I don't know if it still does that, but there is still kernel overhead in managing the extra RAM. Can you rerun these results using the kernel "maxmem" (or whatever it is) boot param to temporarily lock the usable memory to 3GB?

mikeash at 2008-01-13 11:33:44:

I'll be happy to run a new test locking the memory to 3GB if it's not too involved and you can give me more specific instructions. However it will probably have to wait a week or so until after Macworld.

Voolek Varsakharian Pashmaki at 2008-01-14 12:07:26:

I have even less of an idea why it would only be slower on Leopard 32-bit but not 64-bit

It is possible that the 32-bit and 64-bit compilers have their own fork of optimizer, memory management stuff, etc. and the 64-bit compiler is optimized heavily while the 32-bit one is not...

Jordy/Jediknil at 2008-01-15 13:29:35:

Garbage collection?

Even if you had GC off for the test, the 10.5 frameworks might still generate a few little stubs that eat up that bit of extra time. It's just a guess but it would explain why things got slower. Still doesn't explain the 32/64 question but Voolek made a good point about optimization.

smfr at 2008-01-23 13:12:41:

Can you make a chart that has Tiger and Leopard numbers with a delta? That would make the differences more clear.

jd at 2008-02-01 07:48:04:

Which flags did you pass to the compiler?, Eg, -Os, -O3, etc.

mikeash at 2008-02-01 08:55:41:

I only passed flags needed to compile, so things like -framework Cocoa. It was compiled with no optimization.

Dolphin at 2009-06-12 10:27:30:

I recently came across this page while trying to learn Obj-C and these metrics and your explanation of how messages are sent was very helpful. After seeing the original test, I thought the cached IMP test seemed similar to a C++ Pointer to member function. I tried adding an additional test using it and it turned out slightly faster than a virtual function call. I tried a similar test using visual studio and the ptmf was about 60% SLOWER than the virtual function call (which lead me to lots of interesting reading on the topic and different implementations). Since gcc seems to have one of the more sane implementations it might be interesting to include if you choose to do any similar tests like this in the future:

    class StubClass *obj = new StubClass;

    void (StubClass::*func)()=&StubClass::stub;
    BEGIN( 1000000000 )
    (obj->*func)();
    END()

dmpk2k at 2010-04-25 18:51:25:

May I suggest a new benchmark on Snow Leopard? :)

dmpk2k at 2010-04-25 18:52:25:

Whoops, clicked "post" too soon.

It'd be interesting to see what differences there are between GCC and Clang too.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.