Next article: Friday Q&A 2011-12-30: Disassembling the Assembly, Part 3: ARM edition
Previous article: Friday Q&A 2011-12-16: Disassembling the Assembly, Part 1
Tags: assembly disassembly fridayqna guest objectivec
Today I have the pleasure to present the followup to last week's guest post. Gwynne Raskind returns to complete her in-depth analysis of the assembly code generated by a small sample program.
In last week's article, I discussed the x86_64 architecture and the disassembly of the main
function of Mike's example code. This is part 2, in which I look at the differences in optimized code, disassembly of the rest of the sample code, the start
runtime function, and some functions that work with floating-point values. If you haven't yet read part 1, I strongly recommend it, since otherwise part 2 won't make much sense.
Optimization
In part 1, I purposely examined the unoptimized version of the assembly language produced by the compiler, under the theory that optimization would obscure the finer details of how the code works at the assembler level. Now it's time to see what optimized code looks like. Here's main
in assembly again, this time compiled with -Os
:
_main:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
callq _objc_autoreleasePoolPush
movq %rax, %r14
movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi
leaq l_objc_msgSend_fixup_alloc(%rip), %rsi
callq *l_objc_msgSend_fixup_alloc(%rip)
movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
leaq L__unnamed_cfstring_26(%rip), %rdx
movq _objc_msgSend@GOTPCREL(%rip), %rbx
movq %rax, %rdi
movl $42, %ecx
callq *%rbx
movq %rax, %r15
movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
movq %r15, %rdi
callq *%rbx
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, %rbx
movq %rbx, %rdi
callq _MyFunction
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, %r12
movq %rbx, %rdi
callq _objc_release
leaq L__unnamed_cfstring_23(%rip), %rdi
movq %r12, %rsi
xorb %al, %al
callq _NSLog
movq %r12, %rdi
callq _objc_release
movq %r15, %rdi
callq _objc_release
movq %r14, %rdi
callq _objc_autoreleasePoolPop
xorl %eax, %eax
popq %rbx
popq %r12
popq %r14
popq %r15
popq %rbp
ret
The unoptimized version of main
was 60 lines; this optimized code is only 49. The compiler managed to save 11 instructions. Expecting more is unreasonable; optimization, even when done for size savings, tends to be more concerned with making efficient use of the CPU and its abilities than using the absolute minimum number of instructions. On almost any modern processor, there is hugely more benefit in using a few extra simple instructions versus fewer instructions that are more complicated. Compiling with -O3
, which optimizes heavily for speed over size, actually increases the code size to 65 instructions, mostly due to inlining.
Because I've already explained the meaning of all of the individual instructions involved (with one exception), in this breakdown I'll look purely at groups of instructions and how the compiler has optimized each section.
pushq %rbp movq %rsp, %rbp
Look familiar? It should; this is exactly the same instruction sequence
main
started with before. Nothing's changed about the code which sets up the stack pointer; the stack frame has to be set up in a particular way and this is it (more on this later).pushq %r15 pushq %r14 pushq %r12 pushq %rbx
Instead of a bunch of values being stored to the stack, the optimizer has chosen to save the values of several registers to the stack so they can be used as scratch space during the function. The x86_64 ABI specifies which registers are preserved across function calls and which can be freely used as scratch, and none of these are freely usable. Since registers are potentially thousands of times faster than the stack in some cases - in fact, the delay can stretch into the space of seconds if the stack happened to be paged out to disk! - it's certain to be a win to use the stack once at the beginning and once at the end, and manipulate data in registers during the function's execution.
callq _objc_autoreleasePoolPush movq %rax, %r14
objc_autoreleasePoolPush
takes no arguments and returns a simple integer value inrax
. The optimizer saves the return value inr14
instead of spilling it to the stack.movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi leaq l_objc_msgSend_fixup_alloc(%rip), %rsi callq *l_objc_msgSend_fixup_alloc(%rip)
Load the
MyClass
class object intordi
, load the address ofl_objc_msgSend_fixup_alloc
intorsi
, and call the function. It's much the same sequence as the unoptimized code, but without the stack use and all in one place.movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi leaq L__unnamed_cfstring_26(%rip), %rdx movq _objc_msgSend@GOTPCREL(%rip), %rbx movq %rax, %rdi movl $42, %ecx callq *%rbx movq %rax, %r15
Load the
[MyClass initWithName:number:]
selector intorsi
, load@"name"
intordx
, load the address ofobjc_msgSend@GOTPCREL
intorbx
, load the return value fromalloc
intordi
, load42
intoecx
, callobjc_msgSend@GOTPCREL
, and save the return value (i.e.obj
) inr15
.objc_msgSend@GOTPCREL
? What in the world is that thing? Well, as it turns out, it's more than meets the eye. If you peek at the generated machine code with a disassembler, it turns out to not be amov
instruction at all, but rather alea
!GOTPCREL
is a directive which allows therip
-relative address of a function to be inserted at link time so a direct call can be made, if that address can be calculated at link time.objc_msgSend
is one of the functions for which this is true, and optimization lets the compiler make the attempt.In other words, when optimization is on, the compiler generates code that makes a short, fast call to the function instead of making it go through the slower dynamic library call, potentially a "far jump" (a branch over a long distance of code, which, by necessity, is much slower).
Note: I'm not 100% sure of my facts on this one; I'd appreciate any insight anyone has on the specifics of
@GOTPCREL
.movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi movq %r15, %rdi callq *%rbx
Load the
name
selector intorsi
, loadobj
fromr15
intordi
, and callobjc_msgSend
again. This is a case where optimization really begins to show its use. The unoptimized version of this call saved and loaded to and from the stack and other registers, effectively redoing the entire setup for the second message send. The optimizer recognizes that the extra data copying is redundant and just loads everything directly - and even more importantly, avoids loading data that's already in a register into it again.movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue
Grab the return value from the last message send and immediately pass it to
objc_retainAutoreleasedReturnValue
. This is the same sequence as the unoptimized code. In fact, in the Objective-C runtime, certain operations work differently based specifically on the existence of these exact two instructions.movq %rax, %rbx movq %rbx, %rdi callq _MyFunction movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue movq %rax, %r12
Call
MyFunction(name)
, retain its return value, and save the result inr12
. The extra store torbx
looks redundant, but it isn't, as we'll see futher down.movq %rbx, %rdi callq _objc_release
See? Both
rax
andrdi
have already been reused since[MyClass name]
's return value was saved off inrbx
. Not redundant after all!"But why didn't the compiler just leave it in
rbx
to begin with?" Remember that the first parameter to a function must be inrdi
. The value had to be saved somewhere that wasn't about to be overwritten by the very next thing done.leaq L__unnamed_cfstring_23(%rip), %rdi movq %r12, %rsi xorb %al, %al callq _NSLog
Call
NSLog(@"%@", return value of MyFunction)
with no vector registers used - remember that variadic functions require the number of vector registers used as parameters to be inal
. Nothing special here.movq %r12, %rdi callq _objc_release movq %r15, %rdi callq _objc_release
Release both objects (return value of
MyFunction
andobj
) that are no longer in use. Technically,obj
was already unused at the time ofNSLog
, but ARC's code flow analysis isn't that aggressive; releases are done at the end of the enclosing scope, not the instant the value is no longer used.Note: The return from
[MyClass name]
had an effective enclosing scope of theMyFunction
call itself; it was never assigned to a variable (specifically, to a__strong
variable), and therefore was not considered potentially "live" after the function call.movq %r14, %rdi callq _objc_autoreleasePoolPop xorl %eax, %eax popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp ret
Pop the autorelease pool, set
eax
to zero as the return value ofmain
, restore the saved registers, and return.
And that is main
in optimized code. The major effects of optimization visible here are much better utilization of registers; there's not a single use of the stack except for register saving, and there's not a single redundant or useless data copy anywhere to be found.
Do you think you can do better than the compiler did? It's possible that other optimization opportunities exist, but most of the ones that seem immediately obvious are actually prohibited by the CPU, the ABI, or the way Objective-C and ARC work.
Hint: The push and pop of rbp
, as well as the copy of rsp
to rbp
, are unnecessary, because the optimization removed all references to rbp
in the function body! Without those three instructions, main
would still work, but the debugger might not! The debugger relies in some cases upon the presence of stack frames, which include a properly initialized base pointer register and the saved value of the base pointer on the stack. Certain other system functions can potentially rely upon presence of a stack frame, though these rarely come up in normal use. On OS X, the switch which tells GCC and Clang to skip the use of stack frames is disabled by default even at high optimization, suggesting that someone thought it wasn't worth saving three instructions per function. It probably isn't. The system frameworks are built with stack frames intact, for example. In general, you should always include stack frames unless you have a good reason not to.
The MyFunction Function
Next, let's look at the MyFunction
function:
NSString *MyFunction(NSString *parameter)
{
NSString *string2 = [@"Prefix" stringByAppendingString: parameter];
NSLog(@"%@", string2);
return string2;
}
I'm going to take this function backwards. Instead of looking directly at the assembler the compiler produced, I'll construct it myself using what we've already learned from main
about how the compiler does its thing. This function doesn't do anything that main
didn't, after all. For bonus points, we'll even insert the necessary ARC calls.
Function prologue:
_MyFunction: pushq %rbp movq %rsp, %rbp
Every C function has a prologue. See the discussion about stack frames above. This is the stack frame for our new function, along with its label, for completeness' sake. All C function names are prepended with an underscore at the assembler stage as a matter of language convention. A look at the name table of any executable or library will show that almost all of the symbols have at least one preceding underscore.
Save registers:
pushq %rbx
We'll only need one scratch register for this function, so let's use
rbx
.Call
stringByAppendingString:
movq %rdi, %rdx leaq L_prefix_string_reference(%rip), %rdi movq L_stringByAppendingString__selector_reference(%rip), %rsi callq *_objc_msgSend@GOTPCREL(%rip) movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue
First, I make the assumption that the string
@"Prefix"
appears somewhere given the labelL_prefix_string_reference
, which I just made up. Label names are arbitrary; the compiler's very official-looking names are just autogenerated. Even havingL_
in front of them is just a convention I chose to follow to make it look more like the compiler's version. Likewise, I assume thatL_stringByAppendingString__selector_reference
points to the appropriate selector name. From there, I moverdi
tordx
. Sinceparameter
, being the first parameter toMyFunction
, was inrdi
, I've now made it the third parameter to whatever I'm about to call. I load@"prefix"
as the first argument and the-stringByAppendingString:
selector as the second, then call therip
-relative version ofobjc_msgSend
. Finally, I take the return value and pass it toobjc_retainAutoreleasedReturnValue
, per ARC's requirements. ARC functions only at the Objective-C compiler level; in assembler, it has to be invoked manually, like normal retain-release code but with stricter rules.Call
NSLog
:movq %rax, %rsi leaq L_format_string_reference(%rip), %rdi xorb %al, %al callq _NSLog
I'll tell you right now that this code is wrong in one important respect: Because I know I'll need the return value from
-stringByAppendingString:
later, I've made the mistaken assumption thatrax
andrsi
will not be changed by the call toNSLog
. However, the x86_64 ABI explicitly specifies that both registers are not preserved across function calls. We've already clobbered them several times during the course of this code without saving them, so we can hardly expectNSLog
not to do the same. (Not only that, but the code itself zeroes out the low byte ofrax
as part of the call sequence!) The value inrax
andrsi
before this section of code must be preserved, or it will be lost during the call. Note: Even ifNSLog
just so happened to preserversi
, that's not an assumption the calling code can make safely. The only time you can assume registers are preserved by a function outside the specification of the ABI is when you have written every line of that function yourself, in assembly language, and have documented the requirement so you don't violate it later on. The solution is to replace the firstmovq
with these two lines:movq %rax, %rbx movq %rbx, %rsi
The value (known as
string2
in the original Objective-C source) is now saved inrbx
so we can use it. This is why I savedrbx
at the beginning of the function.Return from the function:
movq %rbx, %rdi popq %rbx popq %rbp jmp _objc_autoreleaseReturnValue ## TAIL CALL
Whoa, whoa, wait, what's all this? What's a tail call?
In ARC mode, an object returned from a function not annotated as
cf/ns_returns_retained
must be passed toobjc_autoreleaseReturnValue
. Therefore, that has to be the very last thing the function does before returning."So," you ask, "why not
movq %rbx, %rdi
, thencallq _objc_autoreleaseReturnValue
, and letrax
keep that return value while youpopq
andret
?" Answer: Because it's inefficient. When the very last thing a function does is return the identically-typed result of calling another function, a tail call can be used to save time, space, and effort. At the time of the firstmovq
instruction, the stack looks something like this:+----------------+ | RETURN ADDRESS | 16 <--- next instruction in main, pushed by `callq _MyFunction` | Saved %rbp | 8 <--- saved value of rbp, pushed by prologue | Saved %rbx | 0 <--- saved value of rbx, pushed by our code +----------------+
If I were to simply
callq _objc_autreleaseReturnValue
, the stack would then look like this:+----------------+ | RETURN ADDRESS | 24 <--- next instruction in main, pushed by `callq _MyFunction` | Saved %rbp | 16 <--- saved value of rbp, pushed by prologue | Saved %rbx | 8 <--- saved value of rbx, pushed by our code +----------------+ | RETURN ADDRESS | 0 <---- next instruction in MyFunction, pushed by `callq _objc_autoreleaseReturnValue` +----------------+
When
objc_autoreleaseReturnValue
returned, the stack would be popped by theret
instruction and go back to exactly where it was, and then the same thing would immediately happen again. Wouldn't it be more efficient ifobjc_autoreleaseReturnValue
could return directly tomain
, sinceMyFunction
has absolutely nothing left to do?This is what a tail call does. Instead of using
call
, which pushes a new return address to the stack,MyFunction
restores the stack to having onlymain
's return address, and then jumps directly toobjc_autoreleaseReturnValue
. The stack ends up looking like this:+----------------+ | RETURN ADDRESS | 0 <---- next instruction in main, pushed by `callq _MyFunction`! +----------------+
Now, when the
ret
inobjc_autoreleaseReturnValue
pops a return address off the stack intorip
, it'll jump directly back tomain
, withrax
containing the return value exactly as it should. We've saved a push, a pop, and less visibly, some extra work by the CPU. Thejmp
instruction is also potentially smaller thancallq
if it should happen that the target function is located nearby in memory.Tail calls may look like a minor optimization from the assembly language point of view, but the savings of an entire extra stack frame can make or break recursive algorithms. Also,
objc_msgSend
is fundamentally designed around the use of a tail call; Cocoa programs would probably be something like an order of magnitude slower without them, and can you imagine loading a program in the debugger and seeingobjc_msgSend
before every single method call in the backtrace?
If you look at Clang's version of the assembler code, it's almost exactly the same as ours! There are three exceptions:
- Clang, of course, names the string and selector references differently.
- Clang moves the parameters around in a slightly different order; this has no effect on the execution of the code.
- For no immediately apparent reason, Clang saves the value of
rax
on the stack, only to ignore that value entirely in the function epilogue. What's actually happening is that Clang is aligning the stack to a 16-byte boundary, as required by both SSE instructions in particular and Cocoa in general. This leads to a total of 32 bytes (an even multiple of 16) on the stack for the function: The return address formain
, savedrbp
, savedrbx
, and savedrax
. The requirement of stack alignment is sufficient to overcome the desire to save instructions; the code would be incorrect without that alignment, and probably crash the very next timeobjc_msgSend
was called.
Here, then, is the final version of the function as we've written it, including an aligned stack:
Here's the entire listing in one chunk as we've written it:
_MyFunction:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
pushq %rax
movq %rdi, %rdx
leaq L_prefix_string_reference(%rip), %rdi
movq L_stringByAppendingString__selector_reference(%rip), %rsi
callq *_objc_msgSend@GOTPCREL(%rip)
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, %rsi
leaq L_format_string_reference(%rip), %rdi
xorb %al, %al
callq _NSLog
movq %rax, %rbx
movq %rbx, %rsi
movq %rbx, %rdi
addq $8, %rsp # ignore the saved rax
popq %rbx
popq %rbp
jmp _objc_autoreleaseReturnValue ## TAIL CALL
Simple Floating-Point
Next, I'll look at a new function as a simple example of dealing with non-integer values. Here is the Objective-C version:
float MyFPFunction(float parameter)
{
float x = parameter + 0.5;
x -= 0.3f;
return x;
}
The line in which I call it:
NSLog(@"%f", MyFPFunction(1.0));
And here is the assembler Clang produces:
LCPI7_0:
.long 1056964608 ## float 5.000000e-01
LCPI7_1:
.long 3197737370 ## float -3.000000e-01
_MyFPFunction: ## @MyFPFunction
pushq %rbp
movq %rsp, %rbp
addss LCPI7_0(%rip), %xmm0
addss LCPI7_1(%rip), %xmm0
popq %rbp
ret
(I've omitted the assembler for the actual function call, as it turns out to be extremely difficult to get Clang to actually emit such assembly under optimizing compilation without just inlining the function, and the unoptimized version is different. The only interesting note there in any case is the setting of al
to 1 for the NSLog
call, as it uses a vector register.)
The function is extremely simple:
- A standard prologue comes first.
- Then, since the ABI specifies that the first floating-point value is passed in the first vector register,
xmm0
, the function operates directly on that register. Theaddss
instruction, in simple terms, adds two floating-point values ("add signed single-precision"). The constants in the code,0.5
and-0.3
(subtracting0.3
is the same as adding-0.3
) are stored as data in the executable, since neither assembly language nor the actual machine code have a way to express floating-point immediate values. The values themselves are stored as IEEE-754 single-precision numbers. It just so happens that a floating-point return value is also stored in the first vector register, so by operating directly onxmm0
, the function has already done everything it needed to do. - Finally, a standard function epilogue.
Wasn't that simple? It turns out that the only thing you have to do to use floating-point values is switch to the 128-bit vector registers and the SSE1 instruction set. The old mmx
and st(n)
registers, along with the x87 instruction set, are obsolete. They're also inefficient in comparison to SSE1 operations.
The C runtime
Some things are going on behind the scenes when you launch your program. Did you know that main
isn't the first function the system calls?
That's right! Once dyld
has finished setting up your process' memory space, it branches to the standard entry point, a function called start
which is copied verbatim from the C runtime library (libcrt
) into your executable. It's written in pure assembly and will not appear in Clang's assembler output, as it doesn't exist in your program until linking is done. Here's a look at it. I've borrowed the source code from Apple's website. Per the terms of the APSL under which the code is licensed, I've included the APSL license header in the code listing.
dyld
sees the LC_UNIXTHREAD
load command in your binary and sets up the CPU state accordingly for the new process. A quick glance at the output of otool -l
tells us that the rip
register is initialized to the load address of the start
symbol in the binary image! Clever, no?
The start
function consists of the following code:
/*
* Copyright (c) 1999-2008 Apple Inc. All rights reserved.
*
* @APPLE_LICENSE_HEADER_START@
*
* Portions Copyright (c) 1999 Apple Computer, Inc. All Rights
* Reserved. This file contains Original Code and/or Modifications of
* Original Code as defined in and that are subject to the Apple Public
* Source License Version 1.1 (the "License"). You may not use this file
* except in compliance with the License. Please obtain a copy of the
* License at http://www.apple.com/publicsource and read it before using
* this file.
*
* The Original Code and all software distributed under the License are
* distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, EITHER
* EXPRESS OR IMPLIED, AND APPLE HEREBY DISCLAIMS ALL SUCH WARRANTIES,
* INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE OR NON- INFRINGEMENT. Please see the
* License for the specific language governing rights and limitations
* under the License.
*
* @APPLE_LICENSE_HEADER_END@
*/
start: pushq $0 # push a zero for debugger end of frames marker
movq %rsp,%rbp # pointer to base of kernel frame
andq $-16,%rsp # force SSE alignment
movq 8(%rbp),%rdi # put argc in %rdi
leaq 16(%rbp),%rsi # addr of arg[0], argv, into %rsi
movl %edi,%edx # copy argc into %rdx
addl $1,%edx # argc + 1 for zero word
sall $3,%edx # * sizeof(char *)
addq %rsi,%rdx # addr of env[0], envp, into %rdx
movq %rdx,%rcx
jmp Lapple2
Lapple: add $8,%rcx
Lapple2:cmpq $0,(%rcx) # look for NULL ending env[] array
jne Lapple
add $8,%rcx # once found, next pointer is "apple" parameter now in %rcx
call _main
movl %eax,%edi # pass result from main() to exit()
call _exit # need to use call to keep stack aligned
hlt
start
doesn't work like a C function, since it isn't one. It's intended specifically to transition from a bare-bones executable state to one that C (and Objective-C) can work in. Even the function prologue is unusual.
pushq $0
- Push a zero on the stack. This is used by the debugger as a marker for 'end of stack frames', replacing thepushq %rbp
in a normal function's prologue.movq %rsp,%rbp
- Grab hold of the stack pointer, since the stack is actually used in this function.andq $-16,%rsp
- Mask off the last four bits of the stack pointer. This aligns the initial stack to a 16-byte boundary, as SSE instructions and Cocoa in general require. It's probably an effective no-op, as the system will tend to give a properly aligned stack already, but the C runtime doesn't and can't make that assumption.movq 8(%rbp),%rdi
- The 'kernel frame' the comment mentions above is what exists on the stack whendyld
callsstart
. The first (topmost) value is the familiarargc
parameter tomain
. Putting it inrdi
sets it up as the first argument for a function call.leaq 16(%rbp),%rsi
- The second value on the stack isargv
, so it's now a second function parameter.movl %edi,%edx
- Grab the low 4 bytes ofargc
intordx
.addl $1,%edx
- Add 1 to the copy ofargc
sall $3,%edx
- Multiply the value by 8 (shifting left by 3 is equivalent).edx
now contains the entire size in bytes of theargv
array.addq %rsi,%rdx
- Add the address ofargv
to the calculated size, yielding a pointer to the end ofargv
. Why is this happening? On OS X, the little-usedenvp
array passed as a third parameter tomain
occupies the space in memory immediately followingargv
. The third function parameter is nowenvp
.movq %rdx,%rcx
- Now copyenvp
to the fourth function parameter.jmp Lapple2 Lapple: add $8,%rcx Lapple2:cmpq $0,(%rcx) # look for NULL ending env[] array jne Lapple
These four lines constitute a simple loop which increases the value of
rcx
by 8 until the memory location it points to contains zero. In C terms, this would bewhile (*((uint64_t *)rcx)++);
. Thejne
instruction means "jump if not equal", or equivalently, "jump ifZF
is zero".ZF
was set by the previous instruction,cmp
, which says "setrflags
based on the result of subtracting the two operands, discarding the result itself". This loop finds the end of theNULL
-terminatedenvp
array.addq $8,%rcx
- Skip to the next pointer after the end ofenvp
, which isexec_path
, the fourth argument tomain
, though it's little-known and even more little-used.callq _main
- Finally, callmain
itself.movl %eax,%edi
- Loadmain
's 4-byte return value as the first parameter to a function call.callq _exit
- Call theexit(2)
function, passing it the value returned frommain
.exit(2)
never returns, so no instructions following this one should ever be executed.hlt
- Just in case somehow execution gets here anyway, "halt" the CPU.hlt
will cause a privilege violation exception if executed by non-kernel code, so it makes a fitting "you should not be here" epilogue. It's effectively the equivelant of "unreachable". On very old x86 processors, an application would callhlt
to stop the CPU, but with all the other hardware in a modern computer that needs to be shut down properly, a single instruction is simply inadequate to the purpose. It wouldn't turn off the power, for example.
Conclusion
There's no need to look at the rest of the sample code's disassembly; there's nothing in it that I haven't already explored elsewhere. If you can't make sense of it on your own by now, I've probably done a poor job of explaining! Therefore, I hereby mark the end of part 2.
I've gotten several requests since part 1 to explain these concepts in terms of the ARM architecture used by the iPhone and other iDevices. I haven't worked with ARM at this level before now, but I'm always willing to learn new things. So I've started studying the ARM architecture, and I'll be writing a part 3 to this series of articles based on what I learn and using the same sample code. Until then, good luck, and I hope you've enjoyed my work so far!
Comments:
Are you sure? I don’t see this.
> GOTPCREL is a directive which allows the rip-relative address of a function to be inserted at link time so a direct call can be made
I don’t think that’s quite right. See below.
> "far jump" (a branch over a long distance of code, which, by necessity, is much slower).
Again, I don’t think a “far jump” is necessarily much slower because it is over a long distance of code. It might be marginally slower because you might have to use more bytes to encode the instruction (but that won’t be *much* slower), and it will obviously be slower if the address you’re jumping to causes a page fault or a bit slower if it isn’t in the cache, but it’s not slower simply because it’s a branch over a long distance of code. Things might be a bit slower in your example because an indirect jump is being used (but you’d need to check the processor documentation to see by how much and I wouldn’t be surprised to find that if it’s cached, it’s nil).
> Note: I'm not 100% sure of my facts on this one; I'd appreciate any insight anyone has on the specifics of @GOTPCREL
@GOTPCREL allows you to load an address in a global offset table in a single instruction (using instruction relative addressing). The Global Offset Table (GOT) stores the address of objc_msgSend and any other global addresses that might be required. It is fixed up by dyld at runtime. The GOT is always at a fixed offset relative to code so you can use instruction relative addressing to load an address from it.
All of the above said, I’m no expert on this, so please don’t take my word on the above; I might be wrong.
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
__attribute__((noinline)) float MyFPFunction(float parameter)
The call and print sequence is:
movss LCPI1_0(%rip), %xmm0 # Load argument
callq _MyFPFunction
cvtss2sd %xmm0, %xmm0 # Promote to double
leaq L__unnamed_cfstring_(%rip), %rdi
movb $1, %al
callq _NSLog