Next article: Friday Q&A 2012-11-16: Let's Build objc_msgSend
Previous article: Friday Q&A 2012-11-02: Building the FFT
Tags: assembly dyld fridayqna guest link linking macho
In the course of a recent job interview, I had an opportunity to study some of the internals of
dyld, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.
Because the precise details of how
dyld works are quite complicated and change frequently, and because I don't yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you're curious about the particulars, I strongly recommend
dyld's source code, which is publicly available at http://opensource.apple.com.
So, let's start by talking about static linking, generally referred to simply as 'linking'. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are 'linked' together into a single binary file.
Why does static linking matter to dynamic linking? Because the static linker,
ld64) is responsible for transforming symbol references in your source code into indirect symbol lookups for
dyld to use later. Here's a very simple example:
// This is the actual full declaration of main() on OS X. The "apple"
// parameter is the path to the executable, i.e. _NSGetProgname().
int main(int argc, char **argv, char **envp, char **apple)
The (optimized) assembly for this, as generated by
clang -S test.c -o test.s -Os and stripped of a bit of debug info, is:
_main: ## @main
movq %rsp, %rbp
leaq L_str(%rip), %rdi
xorl %eax, %eax
L_str: ## @str
.asciz "Hello, world!"
Seems straightforward enough. Let's compile it into an object file and dump the fully compiled version (
clang -c test.c -o test.o -Os,
otool -tv test.o):
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 leaq 0x00000000(%rip),%rdi
000000000000000b callq 0x00000010
0000000000000010 xorl %eax,%eax
0000000000000012 popq %rbp
Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the
leaq instruction, the result is a load from the current value of
callq instruction is a "signed offset" jump, which means that the offset of 0 calls the very next instruction in the code (address
0x10 in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (
otool -r test.o):
Relocation information (__TEXT,__text) 2 entries
address pcrel length extern type scattered symbolnum/value
0000000c 1 2 1 2 0 4
00000007 1 2 1 1 0 0
The first entry says, "At offset
0xc in the
__TEXT,__text section, there is an unscattered, external, PC-relative
X86_64_RELOC_BRANCH reference of length 'long word' to the symbol at index 4 in the symbol table." A peek at the symbol table (
nm -ap) gives us:
0000000000000014 s L_str
0000000000000048 s EH_frame0
0000000000000000 T _main
0000000000000060 S _main.eh
The symbol at index 4 (the fifth entry) is
_puts. Similarly, the symbol at index 0 is
L_str, which will be relocated at offset
0x7 of the object file (three bytes into the
leaq instruction). Finally, let's look at the result of linking this object into an executable (
clang test.c -o test -Os,
otool -tv test):
0000000100000f36 pushq %rbp
0000000100000f37 movq %rsp,%rbp
0000000100000f3a leaq 0x00000029(%rip),%rdi
0000000100000f41 callq 0x100000f4a
0000000100000f46 xorl %eax,%eax
0000000100000f48 popq %rbp
- Located the
__TEXTsegment at the standard executable load address for
0x0000000100000000, and the
0xf36after that. The first
0xa0f, since the larger offset doesn't account for the file's Mach-O header) bytes of
__TEXTare zeroed out. This aligns the
__TEXTsegment flush up against the
__DATAsegment. I don't know exactly why this is done, though I assume it has something to do with cache efficiency.
0with the actual offset from the
leaqinstruction to the
L_strsymbol, which in this case is
0x29. The resulting address is
0x100000f61, which a peek at the load commands (
otool -l test) tells us is the exact beginning of the
0with the address of the symbol stub for
puts(), which comes immediately after
main. Another peek at the load commands puts this in the
__TEXT,__stubssection, which we'll look at in detail later.
Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity's sake I won't go into it here.
dyld do, anyway?
dyld is actually responsible for quite a bit of work, all told. It (in roughly this order):
- Bootstraps itself based on the very simple raw stack set up for the process by the kernel.
- Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process' memory space, including any necessary perusal of search paths from both the environment and the executable's "runpaths".
- Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.
- Runs static initializers for the executable.
- Sets up the parameters to the executable's
mainfunction and calls it.
- During the process' execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the
dl*()API), and provides hooks for
gdband other debuggers to get critical information.
- Runs static terminator routines after
- In some scenarios, makes the required call to
I'll examine each step roughly in order.
dyld is the very first code run in a new process. In particular, a symbol by the very descriptive name of
__dyld_start is called. This happens due to a bit of magic in the kernel which notices the
LC_LOAD_DYLINKER load command in the main executable and uses the given dynamic linker's entry symbol as the process' initial instruction pointer.
__dyld_start performs the following pseudocode (the actual implementation is a compact bit of assembly code):
noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
stack push 0 // debugger end of frames marker
stack align 16 // SSE align stack
uint64_t slide = __dyld_start - __dyld_start_static;
void *glue = NULL;
void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
push glue // pretend the return address is a glue routine in dyld
stack restore // undo stack stuff we did before
goto *entry(argc, argv, envp, apple); // never returns
In retrospect, I'm not sure that pseudocode is any more sensible than the assembly would have been, but let's walk through it quickly:
- Push a 0 onto the stack, and align the stack to SSE requirements.
- Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of
dyld's actual bootstrap routine, which sets up some minimal state for
dylditself (such as pulling in certain functions from
libSystemwithout actually linking to it and setting up Mach messaging) and then runs
mainroutine, which does loading, linking, and initializers.
dylddetected that the main executable uses the
LC_MAINload command to set up its entry point, it returns the address of a glue routine which is responsible for calling
_exitwhen the process is done. That address is pushed onto the stack, fooling the entry point into thinking it's the routine's return address; the
retinstruction at the end of that function will jump to that glue code.
- If, on the other hand,
dylddetected the executable using the older
LC_UNIXTHREADload command, it simply restores the stack to its original state and jumps to that entry point, which will be the
startroutine from crt1.o, the C runtime. The C runtime basically redoes all the work that
__dyld_startjust did, minus the actual
dyldstartup, which is one of the reasons it was replaced with the
- Jump to the entry point.
dyld has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, "linking" sure has a lot of different uses, doesn't it?)
Locating the correct binary on disk is usually fairly simple. The
LC_LOAD_DYLIB command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells
dyld to look somewhere else:
@executable_path- Up to OS X 10.3, this was the only marker
dyldsupported, and it had rather limited utility.
dyldwill replace this marker with the full path to the main executable.
@loader_path- Added in 10.4, this marker is replaced with the full path to the binary which loaded the binary that is currently being loaded. This is not always the main executable, and primarily enabled frameworks to themselves embed frameworks without resorting to the "umbrella framework" mechanism, which Apple never made entirely public and actively discouraged the use of.
@rpath- When this marker was added in 10.5, there was much rejoicing. This marker is replaced in sequence with each "run path" embedded in the binary's loading binaries (recursively), enabling frameworks and dynamic libraries to finally be built only once and be used for both system-wide installation and embedding without changes to their install names, and allowing applications to provide alternate locations for a given library, or even override the location specified for a deeply embedded library.
There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.
Once a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.
At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It's not complicated; a lazy symbol's binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.
Conceptually, binding a symbol is simple. In practice, it's rather interesting:
Look up, in the binding information of the
__LINKEDITsegment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for
0xf4a(plus some, I'm shortening for simplicity's sake!). If we were to disassemble the machine code at that address, we would get:
Contents of (__TEXT,__stubs) section 0000000100000f4a jmp *0x000000c0(%rip) Contents of (__TEXT,__stub_helper) section 0000000100000f50 leaq 0x000000b1(%rip),%r11 0000000100000f57 pushq %r11 0000000100000f59 jmp *0x000000a1(%rip) 0000000100000f5f nop 0000000100000f60 pushq $0x00000000 0000000100000f65 jmp 0x100000f50
Wow, a nice simple jump instruction! Unfortunately, it's not quite as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is...
Look up, also in the binding information, the address of the symbol pointer for
__DATA,__nl_symbol_ptrsection. If this is a lazy symbol, look it up in the
__DATA,__la_symbol_ptrsection instead. In our example executable, these sections look simply like this (using a hybrid of
Contents of (__DATA,__nl_symbol_ptr) section 0000000100001000 dq 0x0000000000000000 0000000100001008 dq 0x0000000000000000 Contents of (__DATA,__la_symbol_ptr) section 0000000100001010 dq 0x0000000100000f60
In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!
- Update the address of the symbol pointer in the appropriate
__DATAsection to the real address of the symbol in the loaded library. You're done!
So what, you may be asking, are all this crazy indirection and all these extra sections all about?
Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can't put writable data in the
__TEXT section, which is executable code. This means you can't update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can't put executable code in the
__DATA section, which is writable data! So you can't just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.
All this is true of lazily-bound symbols as well, but with a few caveats.
dyld does not immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn't a simple 0, but rather points to the "stub helper". The stub helper is a bit of code embedded in the
__TEXT,__stub_helper section (really? who'd've guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for
dyld's internal symbol binder. It doesn't show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to
dyld. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.
Static initializers, static terminators, and runtime services
Most of the interesting stuff has already happened at this point.
dyld will run any static initializers in the executable (most often constructors for global C++ objects and
+load methods for Objective-C classes, though there are also
__attribute__((constructor)) functions for plain C). A list of initializers is stored in a separate
__DATA,__mod_init_func section in the binary, and is simply a set of addresses into the
__TEXT,__text section which
dyld calls in order of appearance. Initializer functions are passed the same arguments as
When the process exits,
dyld will also run static terminators, which mostly means static destructors for C++ objects and
__attribute__((destructor)) functions. These are handled just like static initializers, except that they're stored in
__DATA,__mod_term_func and take no parameters. Static terminators run in the same context as an
dyld provides runtime services to binaries it has loaded. The
dl*() APIs are the preferred interface to
dyld's services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):
dlopen- Performs the load stage of loading a dynamic library, can optionally partially or completely perform the bind stage.
dlsym- Look up a symbol in a dynamic library (or the entire process). At its simplest, this is no more than a "name to address" lookup.
dladdr- The inverse of
dlsym, transforming an address into a set of symbol information.
dlclose- Unloads a dynamic library from the process, if no other handles to it are in use. Unloading invalidates all the symbols provided by the dynamic library and can be something of a touchy operation, particularly in an Objective-C environment.
While I've gone over quite a bit, I've also left out a lot of information in this article:
- Two-level namespaces, which prevent trivial symbol collisions in dynamic libraries
dyldshared cache, which maintains a systemwide map of already-loaded dynamic libraries for fast binding
- Code signing
- Dynamic library linking
dyld's expansive set of environment variables
- "Restricted" binaries (particularly
- Most of the kernel's interaction with
- Compression and encryption in Mach-O binaries
dylditself is built
- Symbol interposing
dyld's operation on i386 and ARM, which is conceptually the same, but both architectures differ significantly in the details
- Details of the Mach-O binary format
- How "fat" binaries are handled
I've left these out for two reasons: One, I was a bit behind when writing this article and just didn't have time to put it all in, and two, there really isn't space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and
dyld are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn't seem too interested in updating the documentation):
Apple's Mach-O documentation
Apple's Mach-O reference
The Mach-O "loader" header, a very good reference (also look at other files in the
Apple's dyld Reference
The dlopen(3) manpage
dyld's Release Notes
dyld's source code as of 10.8.2
Kernel source code as of 10.8.2 (look at
bsd/kern/mach_loader.c in particular)
dyld is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and
dyld has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of
dyld, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you'll have a better idea of where to look for the problem. Then again, maybe not;
ld can be pretty obstructive.
That's all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.