Next article: Deconstructing the iPhone SDK: Malware
Previous article: Use strnstr
Tags: cocoa iphone objectivec performance
I finally got a chance to run my performance comparison code on an iPhone, so we can see just how much horsepower this little device has. I still am not able to load my own code onto the device myself, so I want to thank an anonymous benefactor for adapting my code to the new environment and gathering the results for me.
For comparison, you may wish to see the original Performance Comparisons of Common Operations and its followup, Performance Comparisons of Common Operations, Leopard Edition. The source code used in this test can be obtained here.
Here are the times:
Name | Iterations | Total time (sec) | Time per (ns) |
C++ virtual method call | 1000000000 | 80.8 | 80.8 |
IMP-cached message send | 1000000000 | 85.4 | 85.4 |
Floating-point division | 100000000 | 13.4 | 134.4 |
Integer division | 1000000000 | 139.5 | 139.5 |
16 byte memcpy | 100000000 | 17.6 | 175.7 |
Objective-C message send | 1000000000 | 192.9 | 192.9 |
Float division with int conversion | 100000000 | 19.3 | 193.0 |
NSInvocation message send | 10000000 | 19.0 | 1899.0 |
16 byte malloc/free | 100000000 | 198.8 | 1988.4 |
NSObject alloc/init/release | 10000000 | 118.8 | 11883.6 |
NSAutoreleasePool alloc/init/release | 10000000 | 172.7 | 17272.9 |
16MB malloc/free | 100000 | 3.1 | 30754.5 |
Read 16-byte file | 100000 | 51.1 | 511041.3 |
Zero-second delayed perform | 100000 | 67.5 | 674994.5 |
pthread create/join | 10000 | 8.0 | 802160.2 |
Write 16-byte file (atomic) | 10000 | 51.5 | 5153943.7 |
Write 16-byte file | 10000 | 80.9 | 8089726.2 |
1MB memcpy | 10000 | 81.3 | 8130009.1 |
Read 16MB file | 100 | 137.6 | 1376092573.3 |
Write 16MB file (atomic) | 30 | 143.8 | 4793527088.9 |
Write 16MB file | 30 | 151.2 | 5038515361.1 |
Note that this test suite is somewhat reduced compared to the original. NSTask and NSButtonCell don't exist, so those tests were removed. Conceivably they could be replaced with substitutes, but I didn't bother.
The first thing that stands out is the large speed difference for low-level operations compared to the Mac Pro used in the original tests. Of course I wouldn't expect a handheld device to compete against a modern desktop machine, but the contrast is still striking. The worst is the IMP-cached message send, which is over one hundred times slower on the iPhone.
It's also interesting to note that C++ virtual method calls have a better time than IMP-cached message sends. I'll assume that the difference is within the margin of error and that they are both actually the same speed. This is still an interesting result, since the C++ virtual method call involves more indirection than calling an IMP. I would guess that the ARM architecture includes an instruction which natively handles this indirection; anyone familiar with ARM care to comment?
Another interesting pairing is integer and floating-point division. Again these appear to be the same speed on the iPhone, but floating-point division is roughly 3.5 times slower on the Mac Pro. This makes floating-point division on the iPhone merely 15 times slower.
The results also show the atomic file writes to be faster than the non-atomic ones. I have no explanation for this other than testing error, but the difference in timing for the 16-byte file is pretty huge. The 5-8ms time to write the 16-byte file is interestingly large. At that size seek time should completely dominate, and flash memory has effectively no seek time, so I don't understand why this number would be so large. Perhaps CPU performance ends up costing this one so much. The 16MB test shows about a 3MB/s sustained write speed, not too bad.
The 1MB memcpy test reveals roughly 120MB/s of available memory bandwidth. I'm a bit surprised that it's this low, given the on-die RAM, but this is roughly comparable to the rest of the system even so.
Overall, this little machine isn't going to be substituting for a Mac Pro anytime soon, but it's not bad for a pocket-sized computer with such constraints on cost, battery usage, and heat. Now if only Apple would let me put software on it.
Comments:
Assuming you have an object pointer in register r0:
; load the v-table from the object
LDR r0, [r0, #0]
; load the address of the method from the v-table
LDR r1, [r0, #8]
; call the method
BLX r1
LDR loads a 32-bit value from memory using a base pointer with offset addressing scheme.
BLX saves the return address in the link register (r14) and branches to the address contained in the register parameter. It can also change states from ARM to Thumb or Thumb to ARM. BLX will also flush the pipeline (the branch predictor probably won't be much help here).
For the function return, you copy the link register into the program counter.
You can see that it must perform two loads from memory to get the address of the method which probably accounts for the biggest part of the overhead.
I'd be interested to see what the code looks like to call an IMP-cached method.
Since the C++ method call takes ~50 cycles to execute two loads and a branch you get a sense of just how slow the memory system is.
As an aside, the ARM procedure call standard states that the first four parameters to a function are passed in registers r0-r3. If your function takes more than four parameters the additional parameters are passed on the stack (which is sloooow).
Since a C++ method has an implicit this pointer as the first parameter, you should keep C++ methods to three or fewer parameters.
Since an Objective-C method has implicit parameters for self and the selector then you should keep methods to two or fewer parameters (when speed is of the utmost concern).
http://en.wikipedia.org/wiki/Flash_memory
NAND write operations occur at a block granularity. So a 16 byte write must write the size of an entire block. I don't know what block size is used on iPhone, but the article suggests 16, 128 or 256 kilobytes.
Here's a select few (Time per (ns))
IMP-cached message send 0.1
C++ virtual method call 0.2
Objective-C message send 1.3
16 byte malloc/free 41.2
NSObject alloc/init/release 113.9
16MB malloc/free 244.2
C++ virtual method call 0.3
IMP-cached message send 0.4
Objective-C message send 0.8
16 byte malloc/free 134.1
NSObject alloc/init/release 191.8
16MB malloc/free 3725.5
So in some aspects, very competitive with the much more powerful desktop; in others (namely the 16MB malloc), still much slower.
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.