mikeash.com: just this guy, you know?

Posted at 2012-11-09 15:51 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2012-11-16: Let's Build objc_msgSend
Previous article: Friday Q&A 2012-11-02: Building the FFT
Tags: assembly dyld fridayqna guest link linking macho
Friday Q&A 2012-11-09: dyld: Dynamic Linking On OS X
by Gwynne Raskind  

In the course of a recent job interview, I had an opportunity to study some of the internals of dyld, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.

Because the precise details of how dyld works are quite complicated and change frequently, and because I don't yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you're curious about the particulars, I strongly recommend dyld's source code, which is publicly available at http://opensource.apple.com.

Static linking
So, let's start by talking about static linking, generally referred to simply as 'linking'. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are 'linked' together into a single binary file.

Why does static linking matter to dynamic linking? Because the static linker, ld (and ld64) is responsible for transforming symbol references in your source code into indirect symbol lookups for dyld to use later. Here's a very simple example:

    // This is the actual full declaration of main() on OS X. The "apple"
    //  parameter is the path to the executable, i.e. _NSGetProgname().
    int main(int argc, char **argv, char **envp, char **apple)
        puts("Hello, world!\n");
        return 0;

The (optimized) assembly for this, as generated by clang -S test.c -o test.s -Os and stripped of a bit of debug info, is:

            .section        __TEXT,__text,regular,pure_instructions
            .globl  _main
    _main:                                  ## @main
            pushq   %rbp
            movq    %rsp, %rbp
            leaq    L_str(%rip), %rdi
            callq   _puts
            xorl    %eax, %eax
            popq    %rbp
            .section        __TEXT,__cstring,cstring_literals
    L_str:                                  ## @str
            .asciz  "Hello, world!"

Seems straightforward enough. Let's compile it into an object file and dump the fully compiled version (clang -c test.c -o test.o -Os, otool -tv test.o):

    0000000000000000        pushq   %rbp
    0000000000000001        movq    %rsp,%rbp
    0000000000000004        leaq    0x00000000(%rip),%rdi
    000000000000000b        callq   0x00000010
    0000000000000010        xorl    %eax,%eax
    0000000000000012        popq    %rbp
    0000000000000013        ret

Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the leaq instruction, the result is a load from the current value of rip. The callq instruction is a "signed offset" jump, which means that the offset of 0 calls the very next instruction in the code (address 0x10 in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (otool -r test.o):

    Relocation information (__TEXT,__text) 2 entries
    address  pcrel length extern type    scattered symbolnum/value
    0000000c 1     2      1      2       0         4
    00000007 1     2      1      1       0         0

The first entry says, "At offset 0xc in the __TEXT,__text section, there is an unscattered, external, PC-relative X86_64_RELOC_BRANCH reference of length 'long word' to the symbol at index 4 in the symbol table." A peek at the symbol table (nm -ap) gives us:

    0000000000000014 s L_str
    0000000000000048 s EH_frame0
    0000000000000000 T _main
    0000000000000060 S _main.eh
                     U _puts

The symbol at index 4 (the fifth entry) is _puts. Similarly, the symbol at index 0 is L_str, which will be relocated at offset 0x7 of the object file (three bytes into the leaq instruction). Finally, let's look at the result of linking this object into an executable (clang test.c -o test -Os, otool -tv test):

    0000000100000f36        pushq   %rbp
    0000000100000f37        movq    %rsp,%rbp
    0000000100000f3a        leaq    0x00000029(%rip),%rdi
    0000000100000f41        callq   0x100000f4a
    0000000100000f46        xorl    %eax,%eax
    0000000100000f48        popq    %rbp
    0000000100000f49        ret

ld has:

  1. Located the __TEXT segment at the standard executable load address for x86_64, 0x0000000100000000, and the __TEXT,__text section at 0xf36 after that. The first 0xf35 (actually, 0xa0f, since the larger offset doesn't account for the file's Mach-O header) bytes of __TEXT are zeroed out. This aligns the __TEXT segment flush up against the __DATA segment. I don't know exactly why this is done, though I assume it has something to do with cache efficiency.
  2. Replaced 0 with the actual offset from the leaq instruction to the L_str symbol, which in this case is 0x29. The resulting address is 0x100000f61, which a peek at the load commands (otool -l test) tells us is the exact beginning of the __TEXT,__cstring section.
  3. Replaced 0 with the address of the symbol stub for puts(), which comes immediately after main. Another peek at the load commands puts this in the __TEXT,__stubs section, which we'll look at in detail later.

Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity's sake I won't go into it here.

What does dyld do, anyway?
dyld is actually responsible for quite a bit of work, all told. It (in roughly this order):

  1. Bootstraps itself based on the very simple raw stack set up for the process by the kernel.
  2. Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process' memory space, including any necessary perusal of search paths from both the environment and the executable's "runpaths".
  3. Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.
  4. Runs static initializers for the executable.
  5. Sets up the parameters to the executable's main function and calls it.
  6. During the process' execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the dl*() API), and provides hooks for gdb and other debuggers to get critical information.
  7. Runs static terminator routines after main returns.
  8. In some scenarios, makes the required call to libSystem's _exit routine once main returns.

I'll examine each step roughly in order.

dyld is the very first code run in a new process. In particular, a symbol by the very descriptive name of __dyld_start is called. This happens due to a bit of magic in the kernel which notices the LC_LOAD_DYLINKER load command in the main executable and uses the given dynamic linker's entry symbol as the process' initial instruction pointer. __dyld_start performs the following pseudocode (the actual implementation is a compact bit of assembly code):

    noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
        stack push 0 // debugger end of frames marker
        stack align 16 // SSE align stack
        uint64_t slide = __dyld_start - __dyld_start_static;
        void *glue = NULL;
        void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
        if (glue)
            push glue // pretend the return address is a glue routine in dyld
            stack restore // undo stack stuff we did before
        goto *entry(argc, argv, envp, apple); // never returns

In retrospect, I'm not sure that pseudocode is any more sensible than the assembly would have been, but let's walk through it quickly:

  1. Push a 0 onto the stack, and align the stack to SSE requirements.
  2. Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of __dyld_start.
  3. Run dyld's actual bootstrap routine, which sets up some minimal state for dyld itself (such as pulling in certain functions from libSystem without actually linking to it and setting up Mach messaging) and then runs dyld's real main routine, which does loading, linking, and initializers.
  4. If dyld detected that the main executable uses the LC_MAIN load command to set up its entry point, it returns the address of a glue routine which is responsible for calling _exit when the process is done. That address is pushed onto the stack, fooling the entry point into thinking it's the routine's return address; the ret instruction at the end of that function will jump to that glue code.
  5. If, on the other hand, dyld detected the executable using the older LC_UNIXTHREAD load command, it simply restores the stack to its original state and jumps to that entry point, which will be the start routine from crt1.o, the C runtime. The C runtime basically redoes all the work that __dyld_start just did, minus the actual dyld startup, which is one of the reasons it was replaced with the LC_MAIN command.
  6. Jump to the entry point.

Each time dyld has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, "linking" sure has a lot of different uses, doesn't it?)

Locating the correct binary on disk is usually fairly simple. The LC_LOAD_DYLIB command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells dyld to look somewhere else:

There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.

Once a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.

At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It's not complicated; a lazy symbol's binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.

Conceptually, binding a symbol is simple. In practice, it's rather interesting:

  1. Look up, in the binding information of the __LINKEDIT segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for _puts was at 0xf4a (plus some, I'm shortening for simplicity's sake!). If we were to disassemble the machine code at that address, we would get:

        Contents of (__TEXT,__stubs) section
        0000000100000f4a        jmp     *0x000000c0(%rip)
        Contents of (__TEXT,__stub_helper) section
        0000000100000f50        leaq    0x000000b1(%rip),%r11
        0000000100000f57        pushq   %r11
        0000000100000f59        jmp     *0x000000a1(%rip)
        0000000100000f5f        nop
        0000000100000f60        pushq   $0x00000000
        0000000100000f65        jmp     0x100000f50

    Wow, a nice simple jump instruction! Unfortunately, it's not quite as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is...

  2. Look up, also in the binding information, the address of the symbol pointer for puts in the __DATA,__nl_symbol_ptr section. If this is a lazy symbol, look it up in the __DATA,__la_symbol_ptr section instead. In our example executable, these sections look simply like this (using a hybrid of otool's output):

        Contents of (__DATA,__nl_symbol_ptr) section
        0000000100001000        dq      0x0000000000000000
        0000000100001008        dq      0x0000000000000000
        Contents of (__DATA,__la_symbol_ptr) section
        0000000100001010        dq      0x0000000100000f60

    In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!

  3. Update the address of the symbol pointer in the appropriate __DATA section to the real address of the symbol in the loaded library. You're done!

So what, you may be asking, are all this crazy indirection and all these extra sections all about?

Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can't put writable data in the __TEXT section, which is executable code. This means you can't update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can't put executable code in the __DATA section, which is writable data! So you can't just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.

All this is true of lazily-bound symbols as well, but with a few caveats. dyld does not immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn't a simple 0, but rather points to the "stub helper". The stub helper is a bit of code embedded in the __TEXT,__stub_helper section (really? who'd've guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for dyld's internal symbol binder. It doesn't show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to dyld. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.

Static initializers, static terminators, and runtime services
Most of the interesting stuff has already happened at this point. dyld will run any static initializers in the executable (most often constructors for global C++ objects and +load methods for Objective-C classes, though there are also __attribute__((constructor)) functions for plain C). A list of initializers is stored in a separate __DATA,__mod_init_func section in the binary, and is simply a set of addresses into the __TEXT,__text section which dyld calls in order of appearance. Initializer functions are passed the same arguments as main.

When the process exits, dyld will also run static terminators, which mostly means static destructors for C++ objects and __attribute__((destructor)) functions. These are handled just like static initializers, except that they're stored in __DATA,__mod_term_func and take no parameters. Static terminators run in the same context as an atexit() function.

Finally, dyld provides runtime services to binaries it has loaded. The dl*() APIs are the preferred interface to dyld's services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):

What's missing
While I've gone over quite a bit, I've also left out a lot of information in this article:

I've left these out for two reasons: One, I was a bit behind when writing this article and just didn't have time to put it all in, and two, there really isn't space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and dyld are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn't seem too interested in updating the documentation):

Apple's Mach-O documentation
Apple's Mach-O reference
The Mach-O "loader" header, a very good reference (also look at other files in the mach-o/ directory) Apple's dyld Reference
The dlopen(3) manpage
dyld's Release Notes
dyld's source code as of 10.8.2
Kernel source code as of 10.8.2 (look at bsd/kern/kern_exec.c and bsd/kern/mach_loader.c in particular)

dyld is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and dyld has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of dyld, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you'll have a better idea of where to look for the problem. Then again, maybe not; ld can be pretty obstructive.

That's all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!

Did you enjoy this article? I'm selling a whole book full of them. It's available for iBooks and Kindle, plus a direct download in PDF and ePub format. It's also available in paper for the old-fashioned. Click here for more information.


David Morgenstern at 2012-11-14 05:55:55:
BTW: I've used Nick Zitzmann's SynbolicLinker app …


david m.

mikeash at 2012-11-14 14:33:57:
David Morgenstern: Wrong kind of linking....

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Web site:
Formatting: <i> <b> <blockquote> <code>. URLs are automatically hyperlinked.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.