mikeash.com pyblog/friday-qa-2016-04-15-performance-comparisons-of-common-operations-2016-edition.html comments

192.168.1.1 - 2017-06-22 15:31:15

Thu, 22 Jun 2017 15:31:15 GMT

Thanks for informative post, i added the page to bookmarks and i'll come back here later.

Alex - 2017-01-12 05:47:56

Thu, 12 Jan 2017 05:47:56 GMT

I'm actually surprised how slow NSView creation is. It's 30x slower than a plain memory allocation, and less than 2x faster than disk I/O. I wonder what it's doing in there.

I look forward to the Swift additions to this.

mikeash - 2017-01-08 03:44:37

Sun, 08 Jan 2017 03:44:37 GMT

John Wallace: The message cache is not a MRU cache, it's a persistent cache of all messages ever sent. This cache is cleared on certain occasions, such as runtime manipulation of classes that would invalidate it, or loading new binary images, but in most programs the cache persists for a long time. Hitting the cache is the common case, by far. Objective-C would be intolerably slow if it were not.

John Wallace - 2016-12-08 20:44:02

Thu, 08 Dec 2016 20:44:02 GMT

Is the Objective-C message sent in a tight loop? The reason I ask is that there is a MRU cache on Obj-C messages that significantly speeds up repeated calls to a method. Base on your numbers, I'm assuming your test code is hitting that cache. Missing that cache, which is the most common real-world usage pattern, would be much slower because of how it walks the method tables to find a method. If you ever update your tests, it would be interesting to add that test case.

africa - 2016-11-30 23:39:30

Wed, 30 Nov 2016 23:39:30 GMT

On a more general note, since iPhone CPUs are now closing in on 2GHz and multi-core, the real performance differences we’ll see will be about I/O. Traditionally, people think of I/O as disk, and maybe GPU, but people need to remember that main system RAM is also I/O. The current problem with computing today is that the majority of the time, the CPU is sitting around idle waiting on memory or something else.

TZ - 2016-06-22 12:12:47

Wed, 22 Jun 2016 12:12:47 GMT

Hi Mike, I tried running your benchmarks on my machine but I can't build them in the release mode - clang crashes with a setfault. Did you per chance have experienced a similar problem and might know how to fix it? Thanks

Eric Wing - 2016-04-28 05:21:05

Thu, 28 Apr 2016 05:21:05 GMT

Re: Floating-point division vs. integer division

I'm not a CPU expert, so I would like to learn more from those who do know, but there are a few factors.

First, I have been told that while the algorithms for division in both float and integer are complex, because the floating point is split between sign/mantissa/exponent, these operations can actually be split to be done in parallel (in the underlying circuitry). Integer division cannot be split this way so it is a sequential algorithm, and also working on a larger number of bits since it is not split among sign/mantissa/exponent.

Second, integer division is not a common operation whereas float division is usually more useful. So there may be fewer integer divider units on a processor. Whereas you may get several floating point dividers (different ports per core), and this is not counting that each of these is usually SIMD/vectorized so you are expected to do (4 | 8 | 16 | etc) in the same operation. I suspect this compile level will not try to vectorize for SIMD, so we can throw out that difference. But particularly with out-of-order/reorder execution CPUs like Intel and I think the latest Apple chips, because there are multiple floating point divider ports, your pipeline is less likely to stall waiting for a free unit.

On a more general note, since iPhone CPUs are now closing in on 2GHz and multi-core, the real performance differences we’ll see will be about I/O. Traditionally, people think of I/O as disk, and maybe GPU, but people need to remember that main system RAM is also I/O. The current problem with computing today is that the majority of the time, the CPU is sitting around idle waiting on memory or something else.

In real high performance situations, cache hits/misses usually make the biggest differences in performance. Assuming a well written/optimized program that understands things like this, I suspect this is where Mac/desktop will show its huge performance wins as they can sport bigger caches and faster buses. But the kind of benchmark done here won’t make those things show up. This is also the type of thing the compiler optimization flags can’t magically fix either.

Still the conclusion is correct that the iPhone CPU has considerably closed the gap and looks more similar than dissimilar to its desktop counterpart.

Jens Ayton - 2016-04-23 08:21:43

Sat, 23 Apr 2016 08:21:43 GMT

I'm intrigued by the integer division being 2.6 times slower on this Mac than your old one.

MANIAK_dobrii - 2016-04-22 13:08:04

Fri, 22 Apr 2016 13:08:04 GMT

As always, great article.

I wonder, why "Floating-point division with integer conversion" (double/int) is faster than "Integer division" (int/int)? Can this somehow be related to ARM64 instruction set?

Robin Kunde - 2016-04-22 02:23:53

Fri, 22 Apr 2016 02:23:53 GMT

Thanks for putting this together!

The transformation of memcpy into a series of mov instructions despite -O0 happens through a feature in clang/llvm called intrinsic functions. Basically, the compiler can provide its own implementation for certain basic functions and this happens separately from and transparently to the optimizer. You can disable this behavior with -fno-builtin (or set "Recognize Built-in functions" to No in Xcode build settings).

In my test, it changed the speed of the 16byte memcpy from 0.5ns to 2.7ns.

Fernando - 2016-04-16 09:40:39

Sat, 16 Apr 2016 09:40:39 GMT

Typo: "zero-zecond"

Charles Parnot - 2016-04-15 20:42:25

Fri, 15 Apr 2016 20:42:25 GMT

Nice work, thanks a lot for these insights!

The NSView results really make it clear why NSCell should be on its way out, and is now deprecated for NSTableView.

Matt - 2016-04-15 17:45:38

Fri, 15 Apr 2016 17:45:38 GMT

Hey Mike, long time fan/reader here

Quick question, could you also put up the performance of accessing an instance variable directly? There are currently other sources out there that compare the local variable access vs objc_msgsend but they're kind of old and I'm curious to see what you end up with

I'm also aware it's possible that I'm misunderstanding something and that's something you can't measure