Next article: Friday Q&A 2012-11-16: Let's Build objc_msgSend
Previous article: Friday Q&A 2012-11-02: Building the FFT
Tags: assembly dyld fridayqna guest link linking macho
In the course of a recent job interview, I had an opportunity to study some of the internals of dyld
, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.
WARNING
Because the precise details of how dyld
works are quite complicated and change frequently, and because I don't yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you're curious about the particulars, I strongly recommend dyld
's source code, which is publicly available at http://opensource.apple.com.
Static linking
So, let's start by talking about static linking, generally referred to simply as 'linking'. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are 'linked' together into a single binary file.
Why does static linking matter to dynamic linking? Because the static linker, ld
(and ld64
) is responsible for transforming symbol references in your source code into indirect symbol lookups for dyld
to use later. Here's a very simple example:
// This is the actual full declaration of main() on OS X. The "apple"
// parameter is the path to the executable, i.e. _NSGetProgname().
int main(int argc, char **argv, char **envp, char **apple)
{
puts("Hello, world!\n");
return 0;
}
The (optimized) assembly for this, as generated by clang -S test.c -o test.s -Os
and stripped of a bit of debug info, is:
.section __TEXT,__text,regular,pure_instructions
.globl _main
_main: ## @main
pushq %rbp
movq %rsp, %rbp
leaq L_str(%rip), %rdi
callq _puts
xorl %eax, %eax
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_str: ## @str
.asciz "Hello, world!"
Seems straightforward enough. Let's compile it into an object file and dump the fully compiled version (clang -c test.c -o test.o -Os
, otool -tv test.o
):
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 leaq 0x00000000(%rip),%rdi
000000000000000b callq 0x00000010
0000000000000010 xorl %eax,%eax
0000000000000012 popq %rbp
0000000000000013 ret
Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the leaq
instruction, the result is a load from the current value of rip
. The callq
instruction is a "signed offset" jump, which means that the offset of 0 calls the very next instruction in the code (address 0x10
in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (otool -r test.o
):
Relocation information (__TEXT,__text) 2 entries
address pcrel length extern type scattered symbolnum/value
0000000c 1 2 1 2 0 4
00000007 1 2 1 1 0 0
The first entry says, "At offset 0xc
in the __TEXT,__text
section, there is an unscattered, external, PC-relative X86_64_RELOC_BRANCH
reference of length 'long word' to the symbol at index 4 in the symbol table." A peek at the symbol table (nm -ap
) gives us:
0000000000000014 s L_str
0000000000000048 s EH_frame0
0000000000000000 T _main
0000000000000060 S _main.eh
U _puts
The symbol at index 4 (the fifth entry) is _puts
. Similarly, the symbol at index 0 is L_str
, which will be relocated at offset 0x7
of the object file (three bytes into the leaq
instruction). Finally, let's look at the result of linking this object into an executable (clang test.c -o test -Os
, otool -tv test
):
_main:
0000000100000f36 pushq %rbp
0000000100000f37 movq %rsp,%rbp
0000000100000f3a leaq 0x00000029(%rip),%rdi
0000000100000f41 callq 0x100000f4a
0000000100000f46 xorl %eax,%eax
0000000100000f48 popq %rbp
0000000100000f49 ret
ld
has:
- Located the
__TEXT
segment at the standard executable load address forx86_64
,0x0000000100000000
, and the__TEXT,__text
section at0xf36
after that. The first0xf35
(actually,0xa0f
, since the larger offset doesn't account for the file's Mach-O header) bytes of__TEXT
are zeroed out. This aligns the__TEXT
segment flush up against the__DATA
segment. I don't know exactly why this is done, though I assume it has something to do with cache efficiency. - Replaced
0
with the actual offset from theleaq
instruction to theL_str
symbol, which in this case is0x29
. The resulting address is0x100000f61
, which a peek at the load commands (otool -l test
) tells us is the exact beginning of the__TEXT,__cstring
section. - Replaced
0
with the address of the symbol stub forputs()
, which comes immediately aftermain
. Another peek at the load commands puts this in the__TEXT,__stubs
section, which we'll look at in detail later.
Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity's sake I won't go into it here.
What does dyld
do, anyway?dyld
is actually responsible for quite a bit of work, all told. It (in roughly this order):
- Bootstraps itself based on the very simple raw stack set up for the process by the kernel.
- Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process' memory space, including any necessary perusal of search paths from both the environment and the executable's "runpaths".
- Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.
- Runs static initializers for the executable.
- Sets up the parameters to the executable's
main
function and calls it. - During the process' execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the
dl*()
API), and provides hooks forgdb
and other debuggers to get critical information. - Runs static terminator routines after
main
returns. - In some scenarios, makes the required call to
libSystem
's_exit
routine oncemain
returns.
I'll examine each step roughly in order.
Bootstrapdyld
is the very first code run in a new process. In particular, a symbol by the very descriptive name of __dyld_start
is called. This happens due to a bit of magic in the kernel which notices the LC_LOAD_DYLINKER
load command in the main executable and uses the given dynamic linker's entry symbol as the process' initial instruction pointer. __dyld_start
performs the following pseudocode (the actual implementation is a compact bit of assembly code):
noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
{
stack push 0 // debugger end of frames marker
stack align 16 // SSE align stack
uint64_t slide = __dyld_start - __dyld_start_static;
void *glue = NULL;
void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
if (glue)
push glue // pretend the return address is a glue routine in dyld
else
stack restore // undo stack stuff we did before
goto *entry(argc, argv, envp, apple); // never returns
}
In retrospect, I'm not sure that pseudocode is any more sensible than the assembly would have been, but let's walk through it quickly:
- Push a 0 onto the stack, and align the stack to SSE requirements.
- Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of
__dyld_start
. - Run
dyld
's actual bootstrap routine, which sets up some minimal state fordyld
itself (such as pulling in certain functions fromlibSystem
without actually linking to it and setting up Mach messaging) and then runsdyld
's realmain
routine, which does loading, linking, and initializers. - If
dyld
detected that the main executable uses theLC_MAIN
load command to set up its entry point, it returns the address of a glue routine which is responsible for calling_exit
when the process is done. That address is pushed onto the stack, fooling the entry point into thinking it's the routine's return address; theret
instruction at the end of that function will jump to that glue code. - If, on the other hand,
dyld
detected the executable using the olderLC_UNIXTHREAD
load command, it simply restores the stack to its original state and jumps to that entry point, which will be thestart
routine from crt1.o, the C runtime. The C runtime basically redoes all the work that__dyld_start
just did, minus the actualdyld
startup, which is one of the reasons it was replaced with theLC_MAIN
command. - Jump to the entry point.
Loading
Each time dyld
has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, "linking" sure has a lot of different uses, doesn't it?)
Locating the correct binary on disk is usually fairly simple. The LC_LOAD_DYLIB
command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells dyld
to look somewhere else:
@executable_path
- Up to OS X 10.3, this was the only markerdyld
supported, and it had rather limited utility.dyld
will replace this marker with the full path to the main executable.@loader_path
- Added in 10.4, this marker is replaced with the full path to the binary which loaded the binary that is currently being loaded. This is not always the main executable, and primarily enabled frameworks to themselves embed frameworks without resorting to the "umbrella framework" mechanism, which Apple never made entirely public and actively discouraged the use of.@rpath
- When this marker was added in 10.5, there was much rejoicing. This marker is replaced in sequence with each "run path" embedded in the binary's loading binaries (recursively), enabling frameworks and dynamic libraries to finally be built only once and be used for both system-wide installation and embedding without changes to their install names, and allowing applications to provide alternate locations for a given library, or even override the location specified for a deeply embedded library.
There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.
Linking
Once a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.
At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It's not complicated; a lazy symbol's binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.
Conceptually, binding a symbol is simple. In practice, it's rather interesting:
Look up, in the binding information of the
__LINKEDIT
segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for_puts
was at0xf4a
(plus some, I'm shortening for simplicity's sake!). If we were to disassemble the machine code at that address, we would get:Contents of (__TEXT,__stubs) section 0000000100000f4a jmp *0x000000c0(%rip) Contents of (__TEXT,__stub_helper) section 0000000100000f50 leaq 0x000000b1(%rip),%r11 0000000100000f57 pushq %r11 0000000100000f59 jmp *0x000000a1(%rip) 0000000100000f5f nop 0000000100000f60 pushq $0x00000000 0000000100000f65 jmp 0x100000f50
Wow, a nice simple jump instruction! Unfortunately, it's not quite as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is...
Look up, also in the binding information, the address of the symbol pointer for
puts
in the__DATA,__nl_symbol_ptr
section. If this is a lazy symbol, look it up in the__DATA,__la_symbol_ptr
section instead. In our example executable, these sections look simply like this (using a hybrid ofotool
's output):Contents of (__DATA,__nl_symbol_ptr) section 0000000100001000 dq 0x0000000000000000 0000000100001008 dq 0x0000000000000000 Contents of (__DATA,__la_symbol_ptr) section 0000000100001010 dq 0x0000000100000f60
In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!
- Update the address of the symbol pointer in the appropriate
__DATA
section to the real address of the symbol in the loaded library. You're done!
So what, you may be asking, are all this crazy indirection and all these extra sections all about?
Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can't put writable data in the __TEXT
section, which is executable code. This means you can't update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can't put executable code in the __DATA
section, which is writable data! So you can't just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.
All this is true of lazily-bound symbols as well, but with a few caveats. dyld
does not immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn't a simple 0, but rather points to the "stub helper". The stub helper is a bit of code embedded in the __TEXT,__stub_helper
section (really? who'd've guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for dyld
's internal symbol binder. It doesn't show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to dyld
. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.
Static initializers, static terminators, and runtime services
Most of the interesting stuff has already happened at this point. dyld
will run any static initializers in the executable (most often constructors for global C++ objects and +load
methods for Objective-C classes, though there are also __attribute__((constructor))
functions for plain C). A list of initializers is stored in a separate __DATA,__mod_init_func
section in the binary, and is simply a set of addresses into the __TEXT,__text
section which dyld
calls in order of appearance. Initializer functions are passed the same arguments as main
.
When the process exits, dyld
will also run static terminators, which mostly means static destructors for C++ objects and __attribute__((destructor))
functions. These are handled just like static initializers, except that they're stored in __DATA,__mod_term_func
and take no parameters. Static terminators run in the same context as an atexit()
function.
Finally, dyld
provides runtime services to binaries it has loaded. The dl*()
APIs are the preferred interface to dyld
's services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):
dlopen
- Performs the load stage of loading a dynamic library, can optionally partially or completely perform the bind stage.dlsym
- Look up a symbol in a dynamic library (or the entire process). At its simplest, this is no more than a "name to address" lookup.dladdr
- The inverse ofdlsym
, transforming an address into a set of symbol information.dlclose
- Unloads a dynamic library from the process, if no other handles to it are in use. Unloading invalidates all the symbols provided by the dynamic library and can be something of a touchy operation, particularly in an Objective-C environment.
What's missing
While I've gone over quite a bit, I've also left out a lot of information in this article:
- Two-level namespaces, which prevent trivial symbol collisions in dynamic libraries
- The
dyld
shared cache, which maintains a systemwide map of already-loaded dynamic libraries for fast binding - Rebasing
- Code signing
- Dynamic library linking
dyld
's expansive set of environment variables- "Restricted" binaries (particularly
setuid
binaries) - Most of the kernel's interaction with
dyld
- Compression and encryption in Mach-O binaries
- How
dyld
itself is built - Symbol interposing
dyld
's operation on i386 and ARM, which is conceptually the same, but both architectures differ significantly in the details- Details of the Mach-O binary format
- How "fat" binaries are handled
I've left these out for two reasons: One, I was a bit behind when writing this article and just didn't have time to put it all in, and two, there really isn't space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and dyld
are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn't seem too interested in updating the documentation):
Apple's Mach-O documentation
Apple's Mach-O reference
The Mach-O "loader" header, a very good reference (also look at other files in the mach-o/
directory)
Apple's dyld Reference
The dlopen(3) manpage
dyld's Release Notes
dyld's source code as of 10.8.2
Kernel source code as of 10.8.2 (look at bsd/kern/kern_exec.c
and bsd/kern/mach_loader.c
in particular)
Conclusiondyld
is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and dyld
has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of dyld
, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you'll have a better idea of where to look for the problem. Then again, maybe not; ld
can be pretty obstructive.
That's all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!
Comments:
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
http://seiryu.home.comcast.net/~seiryu/symboliclinker.html
david m.