<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>NSBlog</title><link>http://www.mikeash.com/pyblog/</link><description>Mac OS X and Cocoa programming</description><lastBuildDate>Sat, 18 May 2013 15:26:50 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Friday Q&amp;amp;A 2013-05-17: Let's Build stringWithFormat:
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-05-17-lets-build-stringwithformat.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-05-17: Let's Build stringWithFormat:
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 05 17  14 40"
                  tags="fridayqna letsbuild c objectivec cocoa"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-05-17: Let's Build stringWithFormat:
&lt;/div&gt;
              &lt;p&gt;Our long effort to rebuild Cocoa piece by piece continues. For today, reader Nate Heagy has suggested building &lt;code&gt;NSString&lt;/code&gt;'s &lt;code&gt;stringWithFormat:&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;String Formatting&lt;/b&gt;&lt;br&gt;It's hard to get very far in Cocoa without knowing about format strings, but just in case, here's a recap.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stringWithFormat:&lt;/code&gt;, as well as other calls like &lt;code&gt;NSLog&lt;/code&gt;, take strings that can use special format specifiers of the form &lt;code&gt;%x&lt;/code&gt;. The &lt;code&gt;%&lt;/code&gt; indicates that it's a format specifier, which reads an additional argument and adds it to the string. The character after it specifies what kind of data to display. For example:&lt;/p&gt;

&lt;pre&gt;    [NSString stringWithFormat: @"Hello, %@: %d %f", @"world", 42, 1.0]
&lt;/pre&gt;

&lt;p&gt;This produces the string:&lt;/p&gt;

&lt;pre&gt;    Hello, world: 42 1.0
&lt;/pre&gt;

&lt;p&gt;This is useful for all sorts of things, from creating user-visible text, to making dictionary keys, to printing debug logs.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Variable Arguments&lt;/b&gt;&lt;br&gt;This method takes variable arguments, which is an odd corner of C. For more extensive coverage of how to write such methods, see &lt;a href="friday-qa-2009-08-21-writing-vararg-macros-and-functions.html"&gt;my article on vararg macros and functions&lt;/a&gt;. Here's a quick recap.&lt;/p&gt;

&lt;p&gt;You declare the function or method to take variable arguments by putting &lt;code&gt;...&lt;/code&gt; at the end of the parameter list. For a method, this ends up being slightly odd syntax:&lt;/p&gt;

&lt;pre&gt;    + (id)stringWithFormat: (NSString *)format, ...;
&lt;/pre&gt;

&lt;p&gt;That &lt;code&gt;, ...&lt;/code&gt; thing at the end is actually legal Objective-C.&lt;/p&gt;

&lt;p&gt;Once in the method, declare a variable of type &lt;code&gt;va_list&lt;/code&gt; to represent the variable argument list. The &lt;code&gt;va_start&lt;/code&gt; and &lt;code&gt;va_args&lt;/code&gt; macros initialize and clean it up. The &lt;code&gt;va_arg&lt;/code&gt; macro will extract one argument from the list and return it.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Code&lt;/b&gt;&lt;br&gt;As usual, I have posted the code on GitHub. You can view the repository here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mikeash/StringWithFormat"&gt;https://github.com/mikeash/StringWithFormat&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This code supports an extremely limited subset of the full &lt;code&gt;NSString&lt;/code&gt; formatting functionality. &lt;code&gt;NSString&lt;/code&gt; supports a huge number of specifiers, as well as options such as field width, precision, and out-of-order arguments. My reimplementation sticks to a basic set that's enough to illustrate what's going on. In particular, it supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;%d&lt;/code&gt; - &lt;code&gt;int&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%ld&lt;/code&gt; - &lt;code&gt;long&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%lld&lt;/code&gt; - &lt;code&gt;long long&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%u&lt;/code&gt;, &lt;code&gt;%lu&lt;/code&gt;, and &lt;code&gt;%llu&lt;/code&gt;, for the unsigned variants of the above.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%f&lt;/code&gt; - &lt;code&gt;float&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%s&lt;/code&gt; - C strings&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%@&lt;/code&gt; - Objective-C objects&lt;/li&gt;
&lt;li&gt;&lt;code&gt;%%&lt;/code&gt; - Output a single &lt;code&gt;%&lt;/code&gt; character.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Furthermore, no options are supported.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Interface&lt;/b&gt;&lt;br&gt;For my reimplementation, I wrote a function called &lt;code&gt;MAStringWithFormat&lt;/code&gt; that does the same thing as &lt;code&gt;[NSString stringWithFormat:]&lt;/code&gt;. However, I wrapped the meat of the implementation in a class to organize the various bits of state needed. That function just makes a &lt;code&gt;va_list&lt;/code&gt; for the arguments, instantiates a formatter, and asks it to do the work:&lt;/p&gt;

&lt;pre&gt;    NSString *MAStringWithFormat(NSString *format, ...)
    {
        va_list arguments;
        va_start(arguments, format);

        MAStringFormatter *formatter = [[MAStringFormatter alloc] init];
        NSString *result = [formatter format: format arguments: arguments];

        va_end(arguments);

        return result;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;MAStringFormatter&lt;/code&gt; class essentially carries out two tasks in parallel. First, it reads through the format string character-by-character, and secondly, it writes the resulting string. Accordingly, it has two groups of instance variables. The first group deals with reading through the format string:&lt;/p&gt;

&lt;pre&gt;    CFStringInlineBuffer _formatBuffer;
    NSUInteger _formatLength;
    NSUInteger _cursor;
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;CFStringInlineBuffer&lt;/code&gt; is a little-known API in &lt;code&gt;CFString&lt;/code&gt; that allows for efficiently iterating through the individual characters of a string. Making a function or method call for each character is slow, so &lt;code&gt;CFStringInlineBuffer&lt;/code&gt; allows fetching them in bulk for greater efficiency. The length of the format string is stored to avoid running off the end, and the current position within the format string is stored in &lt;code&gt;_cursor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second group deals with collecting the output of the formatting operation. It consists of a buffer of characters, the current location within that buffer, and its total size:&lt;/p&gt;

&lt;pre&gt;    unichar *_outputBuffer;
    NSUInteger _outputBufferCursor;
    NSUInteger _outputBufferLength;
&lt;/pre&gt;

&lt;p&gt;This could be implemented using an &lt;code&gt;NSMutableData&lt;/code&gt; or &lt;code&gt;NSMutableString&lt;/code&gt;, but this is much more efficient. While this code isn't intended to be particularly fast in general, I just couldn't stand the thought of making each character run through a call to a string object.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Reading&lt;/b&gt;&lt;br&gt;&lt;code&gt;MAStringFormatter&lt;/code&gt; has a &lt;code&gt;read&lt;/code&gt; method which fetches the next character from &lt;code&gt;_formatBuffer&lt;/code&gt;, and returns &lt;code&gt;-1&lt;/code&gt; once it reaches the end of the string. There isn't a whle lot to this, just an &lt;code&gt;if&lt;/code&gt; check and a call to &lt;code&gt;CFStringGetCharacterFromInlineBuffer&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (int)read
    {
        if(_cursor &amp;lt; _formatLength)
            return CFStringGetCharacterFromInlineBuffer(&amp;amp;_formatBuffer, _cursor++);
        else
            return -1;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Writing&lt;/b&gt;&lt;br&gt;Writing is a little more complex, because the size of the output string isn't known ahead of time. First, there's a &lt;code&gt;doubleOutputBuffer&lt;/code&gt; method that increases the size of the output buffer. If the buffer is completely empty, it allocates it to hold &lt;code&gt;64&lt;/code&gt; characters. If it's already allocated, it doubles the size:&lt;/p&gt;

&lt;pre&gt;    - (void)doubleOutputBuffer
    {
        if(_outputBufferLength == 0)
            _outputBufferLength = 64;
        else
            _outputBufferLength *= 2;
&lt;/pre&gt;

&lt;p&gt;Once the new buffer length is computed, a simple call to &lt;code&gt;realloc&lt;/code&gt; allocates or reallocates the buffer:&lt;/p&gt;

&lt;pre&gt;        _outputBuffer = realloc(_outputBuffer, _outputBufferLength * sizeof(*_outputBuffer));
    }
&lt;/pre&gt;

&lt;p&gt;Next, there's a &lt;code&gt;write:&lt;/code&gt; method, which takes a single &lt;code&gt;unichar&lt;/code&gt; and appends it to the buffer. If the write cursor is already at the end of the buffer, it first increases the size of the buffer:&lt;/p&gt;

&lt;pre&gt;    - (void)write: (unichar)c
    {
        if(_outputBufferCursor &amp;gt;= _outputBufferLength)
            [self doubleOutputBuffer];
&lt;/pre&gt;

&lt;p&gt;Once sufficient storage is assured, it places &lt;code&gt;c&lt;/code&gt; at the current cursor position, and advances the cursor:&lt;/p&gt;

&lt;pre&gt;        _outputBuffer[_outputBufferCursor] = c;
        _outputBufferCursor++;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Formatting&lt;/b&gt;&lt;br&gt;The &lt;code&gt;format:arguments:&lt;/code&gt; method is the entry point to where the real work gets done. The first thing it does is fill out the format string instance variables using the &lt;code&gt;format&lt;/code&gt; argument:&lt;/p&gt;

&lt;pre&gt;    - (NSString *)format: (NSString *)format arguments: (va_list)arguments
    {
        _formatLength = [format length];
        CFStringInitInlineBuffer((__bridge CFStringRef)format, &amp;amp;_formatBuffer, CFRangeMake(0, _formatLength));
        _cursor = 0;
&lt;/pre&gt;

&lt;p&gt;It also initializes the output variables. This isn't necessary, strictly speaking, but leaves open the possibility of reusing a single formatter object:&lt;/p&gt;

&lt;pre&gt;        _outputBuffer = NULL;
        _outputBufferCursor = 0;
        _outputBufferLength = 0;
&lt;/pre&gt;

&lt;p&gt;After that, it loops through the format string until it runs off the end:&lt;/p&gt;

&lt;pre&gt;        int c;
        while((c = [self read]) &amp;gt;= 0)
        {
&lt;/pre&gt;

&lt;p&gt;All format specifiers begin with the &lt;code&gt;'%'&lt;/code&gt; character. If &lt;code&gt;c&lt;/code&gt; is not a &lt;code&gt;'%'&lt;/code&gt;, then just write the character directly to the output:&lt;/p&gt;

&lt;pre&gt;            if(c != '%')
            {
                [self write: c];
            }
&lt;/pre&gt;

&lt;p&gt;This comparison uses the character literal &lt;code&gt;'%'&lt;/code&gt; despite the fact that &lt;code&gt;read&lt;/code&gt; deals in &lt;code&gt;unichar&lt;/code&gt;. This works because the first 128 Unicode code points map directly to the 128 ASCII characters. When a &lt;code&gt;unichar&lt;/code&gt; contains a &lt;code&gt;%&lt;/code&gt;, it contains the same value as the ASCII &lt;code&gt;'%'&lt;/code&gt;, and the same is true for any other ASCII character. This is terribly convenient when working with ASCII data in &lt;code&gt;NSString&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;c&lt;/code&gt; is a &lt;code&gt;'%'&lt;/code&gt; character, then there's a format specifier to come. What happens at this point depends on what the next character is:&lt;/p&gt;

&lt;pre&gt;            else
            {
                int next = [self read];
&lt;/pre&gt;

&lt;p&gt;If the format specifier is a &lt;code&gt;'d'&lt;/code&gt;, then it reads an &lt;code&gt;int&lt;/code&gt; from the arguments and passes it to the &lt;code&gt;writeLongLong:&lt;/code&gt; method, which handles the actual work of formatting the value into the output. All signed integers pass through that method. Since &lt;code&gt;long long&lt;/code&gt; is the largest signed data type handled, a single method that prints those will work for all signed types:&lt;/p&gt;

&lt;pre&gt;                if(next == 'd')
                {
                    int value = va_arg(arguments, int);
                    [self writeLongLong: value];
                }
&lt;/pre&gt;

&lt;p&gt;If the format specifier is &lt;code&gt;'u'&lt;/code&gt;, then it does the same thing as above, but with &lt;code&gt;unsigned&lt;/code&gt;, and calling through to the &lt;code&gt;writeUnsignedLongLong:&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;                else if(next == 'u')
                {
                    unsigned value = va_arg(arguments, unsigned);
                    [self writeUnsignedLongLong: value];
                }
&lt;/pre&gt;

&lt;p&gt;Note that &lt;code&gt;int&lt;/code&gt; and &lt;code&gt;unsigned&lt;/code&gt; are the smallest integer types handled here. There is no code to handle &lt;code&gt;char&lt;/code&gt; or &lt;code&gt;short&lt;/code&gt;. This is because of C promotion rules for functions that take variable arguments. When passed as a variable arguments, values of type &lt;code&gt;char&lt;/code&gt; or &lt;code&gt;short&lt;/code&gt; are promoted to &lt;code&gt;int&lt;/code&gt;, and likewise the &lt;code&gt;unsigned&lt;/code&gt; variants are promoted to &lt;code&gt;unsigned int&lt;/code&gt;. This means that the code for &lt;code&gt;int&lt;/code&gt; handles the smaller data types as well, without any additional work.&lt;/p&gt;

&lt;p&gt;If the next character is &lt;code&gt;'l'&lt;/code&gt;, then we need to keep reading to figure out what to do:&lt;/p&gt;

&lt;pre&gt;                else if(next == 'l')
                {
                    next = [self read];
&lt;/pre&gt;

&lt;p&gt;If the character following the &lt;code&gt;'l'&lt;/code&gt; is &lt;code&gt;'d'&lt;/code&gt;, then the argument is a &lt;code&gt;long&lt;/code&gt;. Follow the same basic procedure as before:&lt;/p&gt;

&lt;pre&gt;                    if(next == 'd')
                    {
                        long value = va_arg(arguments, long);
                        [self writeLongLong: value];
                    }
&lt;/pre&gt;

&lt;p&gt;Likewise, if the next character is &lt;code&gt;'u'&lt;/code&gt;, it's an &lt;code&gt;unsigned long&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;                    else if(next == 'u')
                    {
                        unsigned long value = va_arg(arguments, unsigned long);
                        [self writeUnsignedLongLong: value];
                    }
&lt;/pre&gt;

&lt;p&gt;If the next character is &lt;code&gt;'l'&lt;/code&gt; again, then we need to read one character further &lt;/p&gt;

&lt;pre&gt;                    else if(next == 'l')
                    {
                        next = [self read];
&lt;/pre&gt;

&lt;p&gt;Here, &lt;code&gt;'d'&lt;/code&gt; indicates a &lt;code&gt;long long&lt;/code&gt;, and &lt;code&gt;'u'&lt;/code&gt; indicates an &lt;code&gt;unsigned long long&lt;/code&gt;. These are handle in the same fashion as before:&lt;/p&gt;

&lt;pre&gt;                        if(next == 'd')
                        {
                            long long value = va_arg(arguments, long long);
                            [self writeLongLong: value];
                        }
                        else if(next == 'u')
                        {
                            unsigned long long value = va_arg(arguments, unsigned long long);
                            [self writeUnsignedLongLong: value];
                        }
                    }
                }
&lt;/pre&gt;

&lt;p&gt;That's it for the deep sequence of &lt;code&gt;'l'&lt;/code&gt; variants. Next comes a check for &lt;code&gt;'f'&lt;/code&gt;. In that case, the argument is a `double', and gets passed off to a method built to handle that:&lt;/p&gt;

&lt;pre&gt;                else if(next == 'f')
                {
                    double value = va_arg(arguments, double);
                    [self writeDouble: value];
                }
&lt;/pre&gt;

&lt;p&gt;Once again, promotion rules simplify things a bit. When a &lt;code&gt;float&lt;/code&gt; is passed as a variable argument, it's promoted to a &lt;code&gt;double&lt;/code&gt;, so no extra code is needed to handle &lt;code&gt;float&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If the format specifier is an &lt;code&gt;'s'&lt;/code&gt;, then the argument is a C string:&lt;/p&gt;

&lt;pre&gt;                else if(next == 's')
                {
                    const char *value = va_arg(arguments, const char *);
&lt;/pre&gt;

&lt;p&gt;This is simple enough not to need a helper method. It iterates through the string until it reaches the terminating &lt;code&gt;0&lt;/code&gt;, writing each character as it goes. This assumes the string contains only ASCII:&lt;/p&gt;

&lt;pre&gt;                    while(*value)
                        [self write: *value++];
                }
&lt;/pre&gt;

&lt;p&gt;If the format specifier is a &lt;code&gt;'@'&lt;/code&gt;, then the argument is an Objective-C object:&lt;/p&gt;

&lt;pre&gt;                else if(next == '@')
                {
                    id value = va_arg(arguments, id);
&lt;/pre&gt;

&lt;p&gt;To find out what to output, ask the value for its description:&lt;/p&gt;

&lt;pre&gt;                    NSString *description = [value description];
&lt;/pre&gt;

&lt;p&gt;The length of the description is also handy:&lt;/p&gt;

&lt;pre&gt;                    NSUInteger length = [description length];
&lt;/pre&gt;

&lt;p&gt;Now, copy the contents of &lt;code&gt;description&lt;/code&gt; into the output buffer. I decided to get a bit fancy here. A simple loop could suffice, perhaps using &lt;code&gt;CFStringInlineBuffer&lt;/code&gt; for speed, but I wanted something nicer. An &lt;code&gt;NSString&lt;/code&gt; can put its contents into an arbitrary buffer, so why not ask &lt;code&gt;description&lt;/code&gt; to put its contents directly into the output buffer? To do that, the output buffer must first be made large enough to contain &lt;code&gt;length&lt;/code&gt; characters:&lt;/p&gt;

&lt;pre&gt;                    while(length &amp;gt; _outputBufferLength - _outputBufferCursor)
                        [self doubleOutputBuffer];
&lt;/pre&gt;

&lt;p&gt;Doing this in a &lt;code&gt;while&lt;/code&gt; loop is mildly inefficient if &lt;code&gt;description&lt;/code&gt; is larger than the buffer is already. However, that's an uncommon case, and the code is nicer by being able to share &lt;code&gt;doubleOutputBuffer&lt;/code&gt;, so I decided to use this approach.&lt;/p&gt;

&lt;p&gt;Now that the output buffer is sufficiently large, use &lt;code&gt;getCharacters:range:&lt;/code&gt; to dump the contents of &lt;code&gt;description&lt;/code&gt; into it, putting it at the location of the output cursor:&lt;/p&gt;

&lt;pre&gt;                    [description getCharacters: _outputBuffer + _outputBufferCursor range: NSMakeRange(0, length)];
&lt;/pre&gt;

&lt;p&gt;Finally, move the cursor past the newly written data:&lt;/p&gt;

&lt;pre&gt;                    _outputBufferCursor += length;
                }
&lt;/pre&gt;

&lt;p&gt;We're nearly to the end. If the character following the &lt;code&gt;'%'&lt;/code&gt; is another &lt;code&gt;'%'&lt;/code&gt;, that's the siganl to write a literal &lt;code&gt;'%'&lt;/code&gt; character:&lt;/p&gt;

&lt;pre&gt;                else if(next == '%')
                {
                    [self write: '%'];
                }
            }
        }
&lt;/pre&gt;

&lt;p&gt;That's the last case handled by this miniature implementation. Once the loop terminates, the resulting &lt;code&gt;unichar&lt;/code&gt;s are located in &lt;code&gt;_outputBuffer&lt;/code&gt;, with &lt;code&gt;_outputBufferCursor&lt;/code&gt; indicating the number of &lt;code&gt;unichar&lt;/code&gt;s in the buffer. Create an &lt;code&gt;NSString&lt;/code&gt; from it and return the new string:&lt;/p&gt;

&lt;pre&gt;        NSString *output = [[NSString alloc] initWithCharactersNoCopy: _outputBuffer length: _outputBufferCursor freeWhenDone: YES];
        return output;
    }
&lt;/pre&gt;

&lt;p&gt;Using the &lt;code&gt;NoCopy&lt;/code&gt; variant makes this potentially more efficient, and removes the need to manually &lt;code&gt;free&lt;/code&gt; the buffer.&lt;/p&gt;

&lt;p&gt;That's the basic shell of the formatting code. To complete it, we need the code to print signed and unsigned &lt;code&gt;long long&lt;/code&gt;s, and code to print &lt;code&gt;double&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;&lt;code&gt;unsigned long long&lt;/code&gt;&lt;/b&gt;&lt;br&gt;Let's start with the most fundamental helper method, &lt;code&gt;writeUnsignedLongLong:&lt;/code&gt;. The others ultimately rely on this one for much of their work.&lt;/p&gt;

&lt;p&gt;The algorithm is simple: divide by successive powers of ten, produing a single digit each time. Convert the digit to a &lt;code&gt;unichar&lt;/code&gt; and write it.&lt;/p&gt;

&lt;p&gt;We'll store the power of ten in a variable called &lt;code&gt;cursor&lt;/code&gt; and start it at &lt;code&gt;1&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (void)writeUnsignedLongLong: (unsigned long long)value
    {
        unsigned long long cursor = 1;
&lt;/pre&gt;

&lt;p&gt;However, what we really want is the power of ten with as many digits as the input number. For example, for &lt;code&gt;42&lt;/code&gt;, we want &lt;code&gt;10&lt;/code&gt;. For &lt;code&gt;123456&lt;/code&gt;, we want &lt;code&gt;100000&lt;/code&gt;. To obtain this, we just keep multiplying &lt;code&gt;cursor&lt;/code&gt; by ten until it has the same number of digits as &lt;code&gt;value&lt;/code&gt;, which is easily tested by seeing if &lt;code&gt;value&lt;/code&gt; is less than ten times larger than &lt;code&gt;cursor&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        while(value / cursor &amp;gt;= 10)
            cursor *= 10;
&lt;/pre&gt;

&lt;p&gt;Now we just loop, dividing &lt;code&gt;cursor&lt;/code&gt; by ten each time, until we run out of &lt;code&gt;cursor&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        while(cursor &amp;gt; 0)
        {
&lt;/pre&gt;

&lt;p&gt;The current digit is obtained by dividing &lt;code&gt;value&lt;/code&gt; by &lt;code&gt;cursor&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            int digit = value / cursor;
&lt;/pre&gt;

&lt;p&gt;To compute the &lt;code&gt;unichar&lt;/code&gt; that corresponds with &lt;code&gt;digit&lt;/code&gt;, just add the literal &lt;code&gt;'0'&lt;/code&gt; character. ASCII (and therefore Unicode) lays out digits sequentially starting with &lt;code&gt;'0'&lt;/code&gt;, making this easy:&lt;/p&gt;

&lt;pre&gt;            [self write: '0' + digit];
&lt;/pre&gt;

&lt;p&gt;With the digit written, we remove it from &lt;code&gt;value&lt;/code&gt;, then move &lt;code&gt;cursor&lt;/code&gt; down:&lt;/p&gt;

&lt;pre&gt;            value -= digit * cursor;
            cursor /= 10;
        }
    }
&lt;/pre&gt;

&lt;p&gt;And just like that, the value flows into the output. This code even correctly handles zero, due to ensuring that &lt;code&gt;cursor&lt;/code&gt; is always at least &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;&lt;code&gt;long long&lt;/code&gt;&lt;/b&gt;&lt;br&gt;The &lt;code&gt;writeLongLong:&lt;/code&gt; method is simple. If the number is less than zero, write a &lt;code&gt;'-'&lt;/code&gt; and negate the number. For positive numbers, do nothing special. Pass the final non-negative number to &lt;code&gt;writeUnsignedLongLong:&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;    - (void)writeLongLong: (long long)value
    {
        unsigned long long unsignedValue = value;
        if(value &amp;lt; 0)
        {
            [self write: '-'];
            unsignedValue = -unsignedValue;
        }
        [self writeUnsignedLongLong: unsignedValue];
    }
&lt;/pre&gt;

&lt;p&gt;There's an odd corner case in here. Due to the nature of &lt;a href="http://en.wikipedia.org/wiki/Two's_complement"&gt;the two's complement representation of signed integers&lt;/a&gt;, the magnitude of the smallest representable &lt;code&gt;long long&lt;/code&gt; is one greater than the magnitude of the largest representable &lt;code&gt;long long&lt;/code&gt; on systems we're likely to encounter.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;long long&lt;/code&gt; on a typical system can hold numbers all the way down to &lt;code&gt;-9223372036854775808&lt;/code&gt;, but only up to &lt;code&gt;9223372036854775807&lt;/code&gt;. This means you can't negate the smallest possible negative number and get a positive number, because the data type can't hold the appropriate positive number. If you try to negate &lt;code&gt;-9223372036854775808&lt;/code&gt;, you get an overflow and undefined behavior, although the result is usually just &lt;code&gt;-9223372036854775808&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;However, negation is well defined on all &lt;code&gt;unsigned&lt;/code&gt; values, and it has the same bitwise result as negation on the bitwise-equivalent signed values. In other words, &lt;code&gt;-signedLongLong&lt;/code&gt; produces the same bits as &lt;code&gt;-(unsigned long long)signedLongLong&lt;/code&gt;. It also works on the bits that make up &lt;code&gt;-9223372036854775808&lt;/code&gt;, and produces &lt;code&gt;9223372036854775808&lt;/code&gt;. By moving &lt;code&gt;value&lt;/code&gt; into &lt;code&gt;unsignedValue&lt;/code&gt; and then negating that, the above code works around the problem of undefined behavior when negating the smallest representable &lt;code&gt;long long&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;&lt;code&gt;double&lt;/code&gt;&lt;/b&gt;&lt;br&gt;Now it's time for the really fun one. Due to the nature of floating-point arithmetic, figuring out how to properly and accurately print the value of a &lt;code&gt;double&lt;/code&gt; was pretty tough. I did some research, even dove into an open-source implementation of &lt;code&gt;printf&lt;/code&gt; to see how they did it, but it was so crazy and incomprehensible that I didn't get too far. I finally settled on a technique which works fairly well, and I think is as accurate as the data type allows, although the output tends to have more digits than it strictly needs.&lt;/p&gt;

&lt;p&gt;The first step in solving the problem is to break it into two pieces. I split the double into the integer part and the fractional part, then deal with each one separately. Print each part in base 10, separate the two with a dot, and done.&lt;/p&gt;

&lt;p&gt;The trick, then, is how to print the integer and fractional parts in base 10. I didn't want to use the same technique of successive division that I used for &lt;code&gt;unsigned long long&lt;/code&gt;, because I was concerned that it would lose accuracy. There are integers that can be represented in a &lt;code&gt;double&lt;/code&gt;, but where the result of dividing the integer by ten can't be exactly represented in a &lt;code&gt;double&lt;/code&gt;. Similarly, I was afraid that the equivalent successive multiplication by ten for the fractional part would lose precision.&lt;/p&gt;

&lt;p&gt;However, dividing or multiplying a double by two to move is &lt;em&gt;always&lt;/em&gt; safe, unless it pushes it beyond the limits of what can be represented. If you only do this to push it closer to &lt;code&gt;1.0&lt;/code&gt;, then it will never lose precision. Furthermore, it's possible to chop off the fractional part of a double without losing precision in the integer part, and vice versa. Put together, these operations allow extracting information from a &lt;code&gt;double&lt;/code&gt; bit by bit, which is enough to compute an integer representation of its integer and fractional parts. With those in hand, the existing &lt;code&gt;writeUnsignedLongLong:&lt;/code&gt; method can be used to print the digits.&lt;/p&gt;

&lt;p&gt;With this in mind, I set off. The first step is to check for infinity and NaN, and short circuit the whole attempt for them:&lt;/p&gt;

&lt;pre&gt;    - (void)writeDouble: (double)value
    {
        if(isinf(value) || isnan(value))
        {
            const char *str = isinf(value) ? "INFINITY" : "NaN";
            while(*str)
                [self write: *str++];
            return;
        }
&lt;/pre&gt;

&lt;p&gt;Otherwise, check for negative values. If &lt;code&gt;value&lt;/code&gt; is negative, write a &lt;code&gt;'-'&lt;/code&gt;, and negate it:&lt;/p&gt;

&lt;pre&gt;        if(value &amp;lt; 0.0)
        {
            [self write: '-'];
            value = -value;
        }
&lt;/pre&gt;

&lt;p&gt;Unlike the &lt;code&gt;long long&lt;/code&gt; case, there are no &lt;code&gt;double&lt;/code&gt; values that can't be safely and correctly negated, so no shenanigans are needed here.&lt;/p&gt;

&lt;p&gt;Next, extract the integer and fractional parts.&lt;/p&gt;

&lt;pre&gt;        double intpart = trunc(value);
        double fracpart = value - intpart;
&lt;/pre&gt;

&lt;p&gt;With those in hand, call out to helper methods to write those two parts, separated by a dot:&lt;/p&gt;

&lt;pre&gt;        [self writeDoubleIntPart: intpart];
        [self write: '.'];
        [self writeDoubleFracPart: fracpart];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Integer Part&lt;/b&gt;&lt;br&gt;Writing the integer part is the simpler of the two, conceptually. The strategy is to shift the &lt;code&gt;double&lt;/code&gt; value one bit to the right until the value becomes zero. Each bit that's extracted is added to an &lt;code&gt;unsigned long long&lt;/code&gt; accumulator. Once the &lt;code&gt;double&lt;/code&gt; becomes zero, the accumulator contains its integer value.&lt;/p&gt;

&lt;p&gt;The one tricky part is how to handle the case where the &lt;code&gt;double&lt;/code&gt; contains a value that's larger than an &lt;code&gt;unsigned long long&lt;/code&gt; can contain. To handle this, whenever the value of the current bit extracted from the &lt;code&gt;double&lt;/code&gt; threatens to overflow, the accumulator is divided by ten to shift it rightwards and allow more room. The total number of shifts is recorded, and the appropriate number of extra zeroes are printed at the end of the number. Dividing the accumulator by ten loses precision, but the 64 bits of an &lt;code&gt;unsigned long long&lt;/code&gt; exceeds the &lt;code&gt;53&lt;/code&gt; bits of precision in a &lt;code&gt;double&lt;/code&gt;, so the lost precision should not actually result in incorrect output. At the least, while the output may not precisely match the integer value stored in the &lt;code&gt;double&lt;/code&gt;, it will be closer to that value than to any other representable &lt;code&gt;double&lt;/code&gt; value, which I'm calling close enough.&lt;/p&gt;

&lt;p&gt;In order to know when the accumulator threatens to overflow, the code needs to know the largest power of ten that can be represented in an &lt;code&gt;unsigned long long&lt;/code&gt;. This method computes it by just computing successive powers of ten until it gets close to &lt;code&gt;ULLONG_MAX&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (unsigned long long)ullongMaxPowerOf10
    {
        unsigned long long result = 1;
        while(ULLONG_MAX / result &amp;gt;= 10)
            result *= 10;
        return result;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;writeDoubleIntPart:&lt;/code&gt; method starts off by initializing a &lt;code&gt;total&lt;/code&gt; variable to zero:&lt;/p&gt;

&lt;pre&gt;    - (void)writeDoubleIntPart: (double)intpart
    {
        unsigned long long total = 0;
&lt;/pre&gt;

&lt;p&gt;This is the accumulator that will hold the total computed value so far. It also keeps track of the value of the current bit:&lt;/p&gt;

&lt;pre&gt;        unsigned long long currentBit = 1;
&lt;/pre&gt;

&lt;p&gt;This is multiplied by two each time a bit is extracted from the &lt;code&gt;double&lt;/code&gt;, and represents the value of that bit.&lt;/p&gt;

&lt;p&gt;The maximum value that can be stored in &lt;code&gt;total&lt;/code&gt; before overflow threatens is cached:&lt;/p&gt;

&lt;pre&gt;        unsigned long long maxValue = [self ullongMaxPowerOf10] / 10;
&lt;/pre&gt;

&lt;p&gt;This is one digit less than the maximum representable power of ten, in order to make sure that it can never overflow the accumulator. There is a surplus of 11 bits of precision in the accumulator, so losing one digit doesn't hurt too much.&lt;/p&gt;

&lt;p&gt;The number of times that &lt;code&gt;total&lt;/code&gt; and &lt;code&gt;currentBit&lt;/code&gt; have been shifted to the right is recorded so that the appropriate number of trailing zeroes can be output later:&lt;/p&gt;

&lt;pre&gt;        unsigned surplusZeroes = 0;
&lt;/pre&gt;

&lt;p&gt;Setup is complete, now it's time to loop until &lt;code&gt;intpart&lt;/code&gt; is exhausted:&lt;/p&gt;

&lt;pre&gt;        while(intpart)
        {
&lt;/pre&gt;

&lt;p&gt;A bit is extracted from &lt;code&gt;intpart&lt;/code&gt; by dividing it by two:&lt;/p&gt;

&lt;pre&gt;            intpart /= 2;
&lt;/pre&gt;

&lt;p&gt;Because &lt;code&gt;intpart&lt;/code&gt; contains an integer, dividing it by two produces a number with a fractional part that is either &lt;code&gt;.0&lt;/code&gt; or &lt;code&gt;.5&lt;/code&gt;. The &lt;code&gt;.5&lt;/code&gt; case represents a one bit that needs to be added to &lt;code&gt;total&lt;/code&gt;. The presence of &lt;code&gt;.5&lt;/code&gt; is checked by using the &lt;code&gt;fmod&lt;/code&gt; function, which computes the remainder when dividing by a number. Using &lt;code&gt;fmod&lt;/code&gt; with &lt;code&gt;1.0&lt;/code&gt; as the second argument just produces the fractional part of the number:&lt;/p&gt;

&lt;pre&gt;            if(fmod(intpart, 1.0))
            {
&lt;/pre&gt;

&lt;p&gt;If the bit is set, then &lt;code&gt;currentBit&lt;/code&gt; is added to &lt;code&gt;total&lt;/code&gt;, and the &lt;code&gt;.5&lt;/code&gt; is sliced off of &lt;code&gt;intpart&lt;/code&gt; using the &lt;code&gt;trunc&lt;/code&gt; function:&lt;/p&gt;

&lt;pre&gt;                total += currentBit;
                intpart = trunc(intpart);
            }
&lt;/pre&gt;

&lt;p&gt;Next, &lt;code&gt;currentBit&lt;/code&gt; is multiplied by two so that it holds the right value for the next bit to be extracted:&lt;/p&gt;

&lt;pre&gt;            currentBit *= 2;
&lt;/pre&gt;

&lt;p&gt;If &lt;code&gt;currentBit&lt;/code&gt; exceeds &lt;code&gt;maxValue&lt;/code&gt;, then both &lt;code&gt;currentBit&lt;/code&gt; and &lt;code&gt;total&lt;/code&gt; get divided by ten, and &lt;code&gt;surplusZeroes&lt;/code&gt; is incremented. Both are rounded when dividing by adding &lt;code&gt;5&lt;/code&gt; to them first, to aid in preserving as much precision as possible:&lt;/p&gt;

&lt;pre&gt;            if(currentBit &amp;gt; maxValue)
            {
                total = (total + 5) / 10;
                currentBit = (currentBit + 5) / 10;
                surplusZeroes++;
            }
        }
&lt;/pre&gt;

&lt;p&gt;Once &lt;code&gt;intpart&lt;/code&gt; is exhausted, &lt;code&gt;total&lt;/code&gt; contains an approximation of its original value, and &lt;code&gt;surplusZeroes&lt;/code&gt; indicates how many times it got shifted over. First, it prints &lt;code&gt;total&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        [self writeUnsignedLongLong: total];
&lt;/pre&gt;

&lt;p&gt;Finally, it prints the appropriate number of trailing zeroes:&lt;/p&gt;

&lt;pre&gt;        for(unsigned i = 0; i &amp;lt; surplusZeroes; i++)
            [self write: '0'];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Fractional Part&lt;/b&gt;&lt;br&gt;The basic idea for printing the fractional part is similar to printing the integer part. The difference is that the accumulator can't directly represent the fractional value, because &lt;code&gt;unsigned long long&lt;/code&gt; doesn't do fractions. Instead, it holds the fractional value, scaled up by some large power of ten. For example, &lt;code&gt;100&lt;/code&gt; might represent &lt;code&gt;1.0&lt;/code&gt;, in which case the value of the first bit in the fractional part of a &lt;code&gt;double&lt;/code&gt; is &lt;code&gt;50&lt;/code&gt;, the second bit is &lt;code&gt;25&lt;/code&gt;, and so forth. The actual numbers used contain a lot more zeroes at the end.&lt;/p&gt;

&lt;p&gt;The accumulator for the integer part can overflow, while the accumulator for the fractional part can &lt;em&gt;underflow&lt;/em&gt;. If the &lt;code&gt;double&lt;/code&gt; contains an extremely small value, the accumulator will end up containing zero, which is no good. A similar strategy is used to deal with this problem, but in the opposite direction: whenever the accumulator and current bit become too small, they are multiplied by ten, and an extra leading zero is output.&lt;/p&gt;

&lt;p&gt;The method starts off with its accumulator initialized to zero:&lt;/p&gt;

&lt;pre&gt;    - (void)writeDoubleFracPart: (double)fracpart
    {
        unsigned long long total = 0;
&lt;/pre&gt;

&lt;p&gt;The value of the current bit is started at the largest power of ten that will fit into an &lt;code&gt;unsigned long long&lt;/code&gt;. This represents &lt;code&gt;1.0&lt;/code&gt;, and will be divided by &lt;code&gt;2&lt;/code&gt; right away to properly represent &lt;code&gt;0.5&lt;/code&gt; for the first bit extracted from the &lt;code&gt;double&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        unsigned long long currentBit = [self ullongMaxPowerOf10];
&lt;/pre&gt;

&lt;p&gt;The threshold for when the numbers become too small is the maximum representable power of ten, divided by ten. When this value is reached, there's a conceptual leading zero, and it's time to shift everything over:&lt;/p&gt;

&lt;pre&gt;        unsigned long long shiftThreshold = [self ullongMaxPowerOf10] / 10;
&lt;/pre&gt;

&lt;p&gt;Now it's time for the loop. Keep extracting bits from &lt;code&gt;fracpart&lt;/code&gt; until there's nothing left:&lt;/p&gt;

&lt;pre&gt;        while(fracpart)
        {
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;fracpart&lt;/code&gt; is shifted to the left by one bit, while &lt;code&gt;currentBit&lt;/code&gt; is simultaneously shifted to the right:&lt;/p&gt;

&lt;pre&gt;            currentBit /= 2;
            fracpart *= 2;
&lt;/pre&gt;

&lt;p&gt;The integer part of the resulting number will be either &lt;code&gt;1&lt;/code&gt; or &lt;code&gt;0&lt;/code&gt;. If it's &lt;code&gt;1&lt;/code&gt;, the corresponding bit is &lt;code&gt;1&lt;/code&gt;, so add &lt;code&gt;currentBit&lt;/code&gt; to &lt;code&gt;total&lt;/code&gt;, and chop the &lt;code&gt;1&lt;/code&gt; off &lt;code&gt;fracpart&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            if(fracpart &amp;gt;= 1.0)
            {
                total += currentBit;
                fracpart -= 1.0;
            }
&lt;/pre&gt;

&lt;p&gt;If both the accumulator and &lt;code&gt;currentBit&lt;/code&gt; are below &lt;code&gt;shiftThreshold&lt;/code&gt;, it's time to shift everything over, and write a leading zero. Note that the number of shifts doesn't need to be tracked like it did in the previous method, because the leading zeroes can be written out immediately:&lt;/p&gt;

&lt;pre&gt;            if(currentBit &amp;lt;= shiftThreshold &amp;amp;&amp;amp; total &amp;lt;= shiftThreshold)
            {
                [self write: '0'];
                currentBit *= 10;
                total *= 10;
            }
        }
&lt;/pre&gt;

&lt;p&gt;Once the loop exits, there's one more task to be done. &lt;code&gt;total&lt;/code&gt; now contains an integer representation of the decimal representation of the fractional part that was passed into the method (whew!), but with potentially a large number of redundant trailing zeroes. For example, if &lt;code&gt;fracpart&lt;/code&gt; contained &lt;code&gt;0.5&lt;/code&gt;, then &lt;code&gt;total&lt;/code&gt; now contains &lt;code&gt;5000000000000000000&lt;/code&gt;, but those trailing zeroes shouldn't be printed in the output. They're removed by just dividing &lt;code&gt;total&lt;/code&gt; by ten repeatedly to get rid of trailing zeroes:&lt;/p&gt;

&lt;pre&gt;        while(total != 0 &amp;amp;&amp;amp; total % 10 == 0)
            total /= 10;
&lt;/pre&gt;

&lt;p&gt;Once that's done, &lt;code&gt;total&lt;/code&gt; is ready to print, so it's passed to &lt;code&gt;writeUnsignedLongLong:&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;        [self writeUnsignedLongLong: total];
    }
&lt;/pre&gt;

&lt;p&gt;That's the end of the adventure of printing a &lt;code&gt;double&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;&lt;code&gt;stringWithFormat:&lt;/code&gt; is an extremely useful method that is, at its heart, a straightforward function that takes variable arguments. There are a ton of subtleties in how to output all of the various data formats, such as the adventure in printing a &lt;code&gt;double&lt;/code&gt; above. There are further complications in supporting all of the various options available in format strings, which the above code doesn't even address. However, it's ultimately a big loop that looks for &lt;code&gt;'%'&lt;/code&gt; format specifiers, and uses &lt;code&gt;va_arg&lt;/code&gt; to extract the arguments passed in by the caller. Although &lt;code&gt;stringWithFormat:&lt;/code&gt; is considerably more complex, you now have a basic idea of how it's put together.&lt;/p&gt;

&lt;p&gt;That's it for today. Come back next time for more bitwise adventures. Friday Q&amp;amp;A is driven by reader suggesions, so until the next time, please keep &lt;a href="mailto:mike@mikeash.com"&gt;sending in your ideas for topics&lt;/a&gt;.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-05-17-lets-build-stringwithformat.html</guid><pubDate>Fri, 17 May 2013 14:40:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-05-03: Proper Use of Asserts
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-05-03-proper-use-of-asserts.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-05-03: Proper Use of Asserts
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 05 03  13 03"
                  tags="fridayqna c objectivec cocoa assert"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-05-03: Proper Use of Asserts
&lt;/div&gt;
              &lt;p&gt;Asserts are a powerful tool for building quality code, but they're often poorly understood. Today, I want to discuss the various options for writing asserts in Cocoa apps and the best ways to use them, a topic suggested by reader Ed Wynne.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;APIs&lt;/b&gt;&lt;br&gt;Fundamentally, an assert is just a call that takes an expression and indicates failure in some way if the expression isn't true. The basic idea is to check for conditions that should always be true so that you fail early and obviously, rather than failing later and confusingly. For example, an array dereference will fail in various weird ways if you give it a bad index:&lt;/p&gt;

&lt;pre&gt;    x = array[index]; // sure hope index is in range
&lt;/pre&gt;

&lt;p&gt;Using an assert can help make it obvious what went wrong:&lt;/p&gt;

&lt;pre&gt;    assert(index &amp;gt;= 0 &amp;amp;&amp;amp; index &amp;lt; arrayLength);
    x = array[index];
&lt;/pre&gt;

&lt;p&gt;This actually demonstrates an API for asserts. C provides the &lt;code&gt;assert&lt;/code&gt; function if you &lt;code&gt;#include &amp;amp;lt;assert.h&amp;amp;gt;&lt;/code&gt;. It takes a single expression. If the expression is true, it does nothing. If it's false, it prints the expression, the file name and line number where the assert is located, and then calls &lt;code&gt;abort&lt;/code&gt;, terminating the program.&lt;/p&gt;

&lt;p&gt;Cocoa provides several assert functions as well. The most basic is &lt;code&gt;NSAssert&lt;/code&gt;. It takes an expression and a string description, which can be a format string:&lt;/p&gt;

&lt;pre&gt;    NSAssert(x != y, @"x and y were equal, this shouldn't happen");
    NSAssert(z &amp;gt; 3, @"z should be greater than 3, but was actually %d", z);
    NSAssert(str != nil, @"nil string while processing %@ of type %@", name, type);
&lt;/pre&gt;

&lt;p&gt;Like the &lt;code&gt;assert&lt;/code&gt; function, this logs the assertion failure if the expression is false. It then throws an &lt;code&gt;NSInternalInconsistencyException&lt;/code&gt;, and what happens then depends on what exception handlers are present. In a typical Cocoa app, it will either be caught and logged by the runloop, or it will terminate the application.&lt;/p&gt;

&lt;p&gt;Unfortunately, the logging from &lt;code&gt;NSAssert&lt;/code&gt; is weak. It logs the fact that the assertion failed, as well as the method it was in and the filename and line number, but it doesn't actually log the expression that failed, nor does it log the reason string provided to the macro. The exception it throws does include the reason string, at least, so as long as the exception gets printed at some point, that will show up.&lt;/p&gt;

&lt;p&gt;There are a few variants of this call available in Cocoa. The &lt;code&gt;NSAssert&lt;/code&gt; call only works within an Objective-C method, so there's an equivalent &lt;code&gt;NSCAssert&lt;/code&gt; call that works in a C function. There's also &lt;code&gt;NSParameterAssert&lt;/code&gt;, which doesn't take a description string and is intended for quickly checking a parameter value, and an equivalent &lt;code&gt;NSCParameterAssert&lt;/code&gt; for C functions.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Build Your Own&lt;/b&gt;&lt;br&gt;The built-in options aren't great. The C &lt;code&gt;assert&lt;/code&gt; is decent, but doesn't allow for a customizable message. The Cocoa calls have bad logging, and their behavior in the event of a failed assertion depends too much on runtime context, and may not actually terminate the app.&lt;/p&gt;

&lt;p&gt;These things aren't hard to build, though, so let's build one that does things right! We'll want a call that takes an expression and optionally a description format string:&lt;/p&gt;

&lt;pre&gt;    MAAssert(x &amp;gt; 0);
    MAAssert(y &amp;gt; 3, @"Bad value for y");
    MAAssert(z &amp;gt; 12, @"Bad value for z: %d", z);
&lt;/pre&gt;

&lt;p&gt;It should log the expression, the format string if it exists, and the filename, line number, and function name where the problem occurred. Additionally, it should only evaluate the format string parameters if the assertion fails, to make things more efficient. All of this calls for a macro.&lt;/p&gt;

&lt;p&gt;&lt;a href="friday-qa-2010-12-31-c-macro-tips-and-tricks.html"&gt;Like all good multi-line macros&lt;/a&gt;, this macro is wrapped in a &lt;code&gt;do&lt;/code&gt;/&lt;code&gt;while&lt;/code&gt; construct:&lt;/p&gt;

&lt;pre&gt;    #define MAAssert(expression, ...) \
        do { \
&lt;/pre&gt;

&lt;p&gt;The first thing it does is check whether the expression is actually false:&lt;/p&gt;

&lt;pre&gt;            if(!(expression)) { \
&lt;/pre&gt;

&lt;p&gt;If it is, it uses &lt;code&gt;NSLog&lt;/code&gt; to log the details of the failure:&lt;/p&gt;

&lt;pre&gt;                NSLog(@"Assertion failure: %s in %s on line %s:%d. %@", #expression, __func__, __FILE__, __LINE__, [NSString stringWithFormat: @"" __VA_ARGS__]); \
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;#expression&lt;/code&gt; construct produces a string literal containing the text of the expression. For example, it will produce &lt;code&gt;"x &amp;gt; 0"&lt;/code&gt; for the first assert call above. The &lt;code&gt;__func__&lt;/code&gt; identifier produces the name of the current function. &lt;code&gt;__FILE__&lt;/code&gt; and &lt;code&gt;__LINE__&lt;/code&gt; should be self-explanatory. The dummy &lt;code&gt;@""&lt;/code&gt; in the &lt;code&gt;stringWithFormat:&lt;/code&gt; call ensures that the syntax is legal even when no reason string is provided.&lt;/p&gt;

&lt;p&gt;After logging the assertion failure, it then terminates the app by calling &lt;code&gt;abort&lt;/code&gt;, and the macro ends:&lt;/p&gt;

&lt;pre&gt;                abort(); \
            } \
        } while(0)
&lt;/pre&gt;

&lt;p&gt;This works perfectly. It allows an additional explanatory string, but doesn't require it for cases where the expression is enough to make it clear what's going wrong. It always calls &lt;code&gt;abort&lt;/code&gt; on failure, rather than throwing an exception that could potentially be caught. It logs all available details at the point of failure.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Application Specific Information&lt;/b&gt;&lt;br&gt;It would be great if we could get the failure message to show up in crash logs as well. Turns out, we can! &lt;a href="https://gist.github.com/ccgus/5020894"&gt;Wil Shipley demonstrated&lt;/a&gt; how to put custom data into the "Application Specific Information" section of a crash log. Put this somewhere in the source code:&lt;/p&gt;

&lt;pre&gt;    const char *__crashreporter_info__ = NULL;
    asm(".desc _crashreporter_info, 0x10");
&lt;/pre&gt;

&lt;p&gt;Any string written into this magic global variable will show up in that section of the crash log. This doesn't work everywhere (word is that it doesn't work on iOS), but it can be handy, and does no harm when it doesn't work. If you want to take advantage of this, a small modification to the assert macro will put the message into this variable as well as logging it:&lt;/p&gt;

&lt;pre&gt;    #define MAAssert(expression, ...) \
        do { \
            if(!(expression)) { \
                NSString *__MAAssert_temp_string = [NSString stringWithFormat: @"Assertion failure: %s in %s on line %s:%d. %@", #expression, __func__, __FILE__, __LINE__, [NSString stringWithFormat: @"" __VA_ARGS__]]; \
                NSLog(@"%@", __MAAssert_temp_string); \
                __crashreporter_info__ = [__MAAssert_temp_string UTF8String]; \
                abort(); \
            } \
        } while(0)
&lt;/pre&gt;

&lt;p&gt;And, as if by magic, the message appears in the crash log.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Philosophy&lt;/b&gt;&lt;br&gt;Now that you know how to write an assert in many different ways, just &lt;em&gt;what kind&lt;/em&gt; of asserts should you write?&lt;/p&gt;

&lt;p&gt;Asserts should be written for conditions that, according to your understanding of the program, should &lt;em&gt;never&lt;/em&gt; occur. Asserts should &lt;em&gt;not&lt;/em&gt; be used to check for errors that are actually expected to happen in some cases. For example, asserting that a filename is not &lt;code&gt;nil&lt;/code&gt; is good technique:&lt;/p&gt;

&lt;pre&gt;    assert(filename != nil);
&lt;/pre&gt;

&lt;p&gt;However, asserting that data could be read from that file is bad practice:&lt;/p&gt;

&lt;pre&gt;    NSData *data = [NSData dataWithContentsOfFile: filename];
    assert(data != nil);
&lt;/pre&gt;

&lt;p&gt;That call can legitimately fail due to real-world conditions, such as the file not existing on disk, or not having permissions to read it. Because of that, this code needs some actual error handling, not just an assert. Failing to read the file should result in taking an alternate approach or alerting the user that something went wrong, not just logging and terminating the app.&lt;/p&gt;

&lt;p&gt;Typically, the most useful place for asserts is at the top of a function or method, to check constraints on the parameters that can't be expressed in the language directly. These asserts correspond directly to constraints expressed in the documentation. For example:&lt;/p&gt;

&lt;pre&gt;    // Flange an array of sprockets. The sprockets array must contain
    // at least two entries, and the index must lie within the array.
    - (void)flangeSprockets: (NSArray *)array fromIndex: (NSUInteger)index
    {
        assert([array count] &amp;gt;= 2);
        assert(index &amp;lt; [array count]);

        ...method body...
&lt;/pre&gt;

&lt;p&gt;The gap between a caller and a callee makes it easy to lose track of these constraints, making this an excellent place to double-check that everything is as it should be. Special attention should be paid to parameters that are easy to screw up, and to parameters where bad values will cause strange failures. For example, this assert checking for &lt;code&gt;NULL&lt;/code&gt;, while still useful, doesn't add much, since the resulting crash without it would still be fairly clear:&lt;/p&gt;

&lt;pre&gt;    assert(ptr != NULL);
    x = *ptr;
&lt;/pre&gt;

&lt;p&gt;It's not &lt;em&gt;bad&lt;/em&gt;, but your time may be better spent elsewhere. This assert checking for &lt;code&gt;nil&lt;/code&gt; is really handy, as a &lt;code&gt;nil&lt;/code&gt; value here will just result in a strangely built string, which could show up far away and much later:&lt;/p&gt;

&lt;pre&gt;    assert(name != nil);
    str = [NSString stringWithFormat@"Hello, %@!", name];
&lt;/pre&gt;

&lt;p&gt;It can also be handy to add asserts in the middle of complex code which has clear pre or post-conditions. For example, in the middle of modifying a data structure, you might check to make sure all of your variables have consistent values between themselves:&lt;/p&gt;

&lt;pre&gt;    assert(done + remaining == total);
&lt;/pre&gt;

&lt;p&gt;This will let you catch logic errors quickly.&lt;/p&gt;

&lt;p&gt;Avoid asserts for obvious conditions that have little room for error. For example, these are pointless:&lt;/p&gt;

&lt;pre&gt;    int x = 1;
    assert(x == 1);

    for(int i = 0; i &amp;lt; 10; i++)
    {
        assert(i &amp;gt;= 0);
        ...
&lt;/pre&gt;

&lt;p&gt;There's no way these asserts will fire unless the computer is seriously malfunctioning, so they're basically a waste of time. Concentrate on things that "can't happen" if parts of your program work together as they should, but that you could conceivably miss.&lt;/p&gt;

&lt;p&gt;Finally, make sure that the conditions you're asserting are reasonably fast to evaluate. You don't want them bogging down your program. Don't loop through your million-element array asserting a complex condition on every entry just out of paranoia.&lt;/p&gt;

&lt;p&gt;In short, assert essential preconditions of your code, with an eye toward things that will cause you pain if not caught early. The goal is to get a leg up on debugging when things start to go wrong.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Disabling Asserts&lt;/b&gt;&lt;br&gt;If you search the web for information about asserts, you'll invariably turn up discussions of how to disable asserts in your release builds. Most assert systems have a way to disable asserts program-wide. For the C &lt;code&gt;assert&lt;/code&gt; call, setting the &lt;code&gt;NDEBUG&lt;/code&gt; macro disables it. For the Cocoa assert calls, setting the &lt;code&gt;NS_BLOCK_ASSERTIONS&lt;/code&gt; macro disables them. There are generally two reasons given for disabling asserts in release builds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Asserts impose a runtime cost that you shouldn't make every user pay. In theory, if you've tested thoroughly, you shouldn't encounter any assertion failures in your release builds anyway.&lt;/li&gt;
&lt;li&gt;An assertion failure immediately terminates the app, which users don't like. By removing asserts, you give the program a chance to continue functioning in the face of a bug.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, I am firmly of the opinion that disabling asserts in release builds is a terrible idea. The runtime cost should be negligible, and if it's not, then you should redo your asserts to fix that. As for avoiding app termination, asserts should be written such that a failure always means that something has gone terribly wrong. It is &lt;em&gt;possible&lt;/em&gt; that the app will continue functioning in the face of that. It's more likely that it'll crash. It's also possible that it'll keep running, but corrupt your user's data. A clean crash is vastly preferable. No code is free of bugs, and crashing early and obviously when a bug is encountered is much better, even in a release build running on a user's machine. Generating a cleaner crash log will help you debug the failures more quickly.&lt;/p&gt;

&lt;p&gt;The example &lt;code&gt;MAAssert&lt;/code&gt; macro above doesn't have any built-in way to disable it for this reason. If you use a different assert facility, I strongly recommend that you avoid ever turning them off.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Asserts are a valuable tool for producing better code and making bugs easier to find and fix. Asserts should be used anywhere there's a constraint on a value that isn't enforced by the language. My general guideline is that if you document a restriction for callers, you should also assert it in the code. If you ever find yourself writing some code that gets you thinking a lot, and has variables whose values should relate to each other in a certain way not enforced by the language, assert that in the code.&lt;/p&gt;

&lt;p&gt;That's it for today. Friday Q&amp;amp;A is driven by reader suggestions as always, so if there's a topic that you'd like to see covered here, please &lt;a href="mailto:mike@mikeash.com"&gt;send it in&lt;/a&gt;. Come back soon for another exciting article.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-05-03-proper-use-of-asserts.html</guid><pubDate>Fri, 03 May 2013 13:03:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A Is On Vacation
</title><link>http://www.mikeash.com/pyblog/friday-qa-is-on-vacation.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A Is On Vacation
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 04 19  14 24"
                  tags=""
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A Is On Vacation
&lt;/div&gt;
              &lt;p&gt;Friday Q&amp;amp;A is on vacation at the moment, so there will be no article today. Never fear: although he's busy relaxing today, Friday Q&amp;amp;A will return, and soon. To keep my loyal readers from being too upset, here is a collection of links to interesting articles to keep you occupied.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/b/oldnewthing/archive/2012/08/10/10334565.aspx"&gt;How did real-mode Windows implement its LRU algorithm without hardware assistance?&lt;/a&gt; - A fascinating look at how a really old system pulled off a good chunk of virtual memory with no hardware assistance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/b/oldnewthing/archive/2012/09/28/10353944.aspx"&gt;Data in crash dumps are not a matter of opinion&lt;/a&gt; - A case study in challenging your assumptions and believing the evidence over your preconceptions when debugging code. This is another &lt;a href="http://blogs.msdn.com/b/oldnewthing/"&gt;Old New Thing&lt;/a&gt; link, and I could easily give dozens of links to great articles just on that blog. Don't mind the Windows-centric topics, it's well worth your time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://prog21.dadgum.com/116.html?0"&gt;Things That Turbo Pascal is Smaller Than&lt;/a&gt; - Pretty self-explanatory. Old software is small.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://stackoverflow.com/questions/6123024/synchronization-on-the-local-variables"&gt;Synchronization on the local variables&lt;/a&gt; - A construct that makes absolutely no sense in general ends up providing an important speed gain through the magic of JIT compilation and lock elision.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://www.viva64.com/en/examples/"&gt;Errors detected in Open Source projects by the PVS-Studio developers through static analysis&lt;/a&gt; - Makers of a static code analyzer run their code on a lot of open source projects and summarize the results. Great list of common errors to watch out for in your own code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://www.cs.dartmouth.edu/~sergey/langsec/papers/Bratus.pdf"&gt;Weird Machines&lt;/a&gt; - Breaks down the process of exploiting security bugs, and describes "weird machines" that result: accidental general-purpose computers formed by manipulating programs in ways they weren't intended to be used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://svpow.com/2011/05/23/the-worlds-longest-cells-speculations-on-the-nervous-systems-of-sauropods/"&gt;The world's longest cells? Speculations on the nervous systems of sauropods&lt;/a&gt; - Without clicking the link, take a guess as to how long the longest cell that ever existed was. I won't spoil it, but if you're like me, your guess is off by many orders of magnitude.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-is-on-vacation.html</guid><pubDate>Fri, 19 Apr 2013 14:24:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-04-05: Windows and Window Controllers
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-04-05-windows-and-window-controllers.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-04-05: Windows and Window Controllers
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 04 05  13 36"
                  tags="fridayqna cocoa design window"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-04-05: Windows and Window Controllers
&lt;/div&gt;
              &lt;p&gt;It's time to take a turn to some lighter fare, but to a subject that's near and dear to my heart. The fundamental UI component of a Cocoa app is the NSWindow, and there are many different ways to instantiate and manage them, but there is only one correct way: for each type of window, there should be a separate nib file, and a specialized &lt;code&gt;NSWindowController&lt;/code&gt; subclass. I'll walk through what this means and how to do it, a topic suggested by reader Mike Shields.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Variants&lt;/b&gt;&lt;br&gt;It's common to see different ways of instantiating and managing windows. Xcode's templates, for example, put an &lt;code&gt;NSWindow&lt;/code&gt; instance in &lt;code&gt;MainMenu.xib&lt;/code&gt;, and treat the application delegate as the window's controller. It's common to pack multiple related windows into a single nib. People sometimes instantiate windows in code wherever they need them. Some will subclass NSWindow and put the controlling code in the subclass.&lt;/p&gt;

&lt;p&gt;The central thesis of this article is that all of the approaches listed above are wrong. Yes, even the Xcode template, the first thing people see when they check out this whole Cocoa thing, is wrong. This is the fundamental correct design:&lt;/p&gt;

&lt;pre&gt;    one window = one nib + one NSWindowController subclass
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Why&lt;/b&gt;&lt;br&gt;Fundamental separation of concerns makes this the best approach, and it ultimately costs no more time or effort than lesser approaches.&lt;/p&gt;

&lt;p&gt;Most windows need a lot of controller functionality. Even extremely basic windows can eventually grow to take on a lot of tasks. As the largest unit of Cocoa UI, each kind of window needs and deserves its own controller class. It's possible to cram the logic for multiple windows into a single controller, but this ultimately makes no more sense than cramming the logic for a string and an array into the same class, just because you happen to be using them at the same time.&lt;/p&gt;

&lt;p&gt;Most windows also function as independent units. It's rare to have a window that &lt;em&gt;always&lt;/em&gt; appears with another window. Even if it does now, it may not later as you evolve your UI. Because of this, each window should be in its own nib file, separate from any others. The only objects in a nib, other than &lt;code&gt;MainMenu.xib&lt;/code&gt;, should be File's Owner, which is an instance of your &lt;code&gt;NSWindowController&lt;/code&gt; subclass, the window itself, and any non-window objects related to the window, such as auxiliary views and controller objects. &lt;code&gt;MainMenu.xib&lt;/code&gt; is a special case: it should contain File's Owner, which is the &lt;code&gt;NSApplication&lt;/code&gt; instance, the menu bar, the application delegate, any other objects related to these, but &lt;em&gt;no &lt;code&gt;NSWindow&lt;/code&gt; instances&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;How&lt;/b&gt;&lt;br&gt;Start off by creating an &lt;code&gt;NSWindowController&lt;/code&gt; subclass. Give it a name like &lt;code&gt;MAImportantThingWindowController&lt;/code&gt; to make it obvious what it is.&lt;/p&gt;

&lt;p&gt;Next, create the nib. Give this one a name like &lt;code&gt;MAImportantThingWindow.xib&lt;/code&gt;. The Xcode template for a nib with a window it in will set things up well, so you can use that. If you prefer to build it yourself, create a new empty nib file, then add a new window to it from the library.&lt;/p&gt;

&lt;p&gt;Set up the nib with the &lt;code&gt;NSWindowController&lt;/code&gt; subclass. Set the class of File's owner to the controller class. Once that's done, connect the controller's &lt;code&gt;window&lt;/code&gt; outlet to the window, and connect the window's &lt;code&gt;delegate&lt;/code&gt; outlet to the controller.&lt;/p&gt;

&lt;p&gt;That's it for nib setup, beyond whatever specific UI you want to build yourself. It's time to make the controller aware of the nib.&lt;/p&gt;

&lt;p&gt;Xcode will pre-populate some methods in the subclass for you, but these aren't important. The &lt;code&gt;windowDidLoad&lt;/code&gt; implementation it provides is useful, but doesn't contain anything interesting. The &lt;code&gt;initWithWindow:&lt;/code&gt; method it provides is pointless and can be deleted.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NSWindowController&lt;/code&gt; provides a &lt;code&gt;initWithWindowNibName:&lt;/code&gt; method. However, your subclass is built to work with only a single nib, so it's pointless to make clients specify that nib name. Instead, we'll provide a plain &lt;code&gt;init&lt;/code&gt; method that does the right thing internally. Simply override it to call &lt;code&gt;super&lt;/code&gt; and provide the nib name:&lt;/p&gt;

&lt;pre&gt;    - (id)init
    {
        return [super initWithWindowNibName: @"MAImportantThingWindow"];
    }
&lt;/pre&gt;

&lt;p&gt;If your window controller needs parameters to set itself up, for example a model object that it's going to display and edit, then those parameters can be added to this &lt;code&gt;init&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;Optionally, depending on your level of paranoia, you &lt;em&gt;may&lt;/em&gt; override initWithWindowNibName: to guard against accidentally calling it from elsewhere:&lt;/p&gt;

&lt;pre&gt;    - (id)initWithWindowNibName: (NSString *)name
    {
        NSLog(@"External clients are not allowed to call -[%@ initWithWindowNibName:] directly!", [self class]);
        [self doesNotRecognizeSelector: _cmd];
    }
&lt;/pre&gt;

&lt;p&gt;I don't personally bother with this sort of guard most of the time, but it can be comforting or potentially useful to have depending on your habits and those of the people you work with.&lt;/p&gt;

&lt;p&gt;If you have instance-specific initialization to perform, that can be done in the &lt;code&gt;init&lt;/code&gt; method just like any other class. Note, however, that outlets are &lt;em&gt;not&lt;/em&gt; connected at this point, so you can't do anything that involves those. UI initialization comes later.&lt;/p&gt;

&lt;p&gt;After the nib loads, &lt;code&gt;NSWindowController&lt;/code&gt; calls &lt;code&gt;windowDidLoad&lt;/code&gt;, which is the perfect override point for UI initialization:&lt;/p&gt;

&lt;pre&gt;    - (void)windowDidLoad
    {
        [_myView setColor: ...];
        [_myButton setImage: ...];
    }
&lt;/pre&gt;

&lt;p&gt;The implementation in &lt;code&gt;NSWindowController&lt;/code&gt; is documented to do nothing, so it's not necessary to call &lt;code&gt;super&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NSWindowController&lt;/code&gt; loads its nib lazily. When initialized, it just remembers which nib it's supposed to use. Only when you ask for its window does it actually proceed to load the nib. Because of this, you need to be careful when accessing outlets in code that may run before the nib loads. For example, this method will silently fail if it's called before the window is loaded:&lt;/p&gt;

&lt;pre&gt;    - (void)setName: (NSString *)name
    {
        [_nameField setStringValue: name];
    }
&lt;/pre&gt;

&lt;p&gt;There are two ways to work around this. One is to simply force the window to load before using an outlet:&lt;/p&gt;

&lt;pre&gt;    - (void)setName: (NSString *)name
    {
        [self window];
        [_nameField setStringValue: name];
    }
&lt;/pre&gt;

&lt;p&gt;This has some unnecessary overhead if the window wouldn't otherwise be loaded at this point, but works well enough. Most of the time, a window controller is being used because it's going to display the window.&lt;/p&gt;

&lt;p&gt;The other way is to keep an instance variable for the data as well as setting it in the UI. The setter will both set the instance variable and manipulate the outlet:&lt;/p&gt;

&lt;pre&gt;    - (void)setName: (NSString *)name
    {
        _name = name;
        [_nameField setStringValue: _name];
    }
&lt;/pre&gt;

&lt;p&gt;This also needs a line of code in &lt;code&gt;windowDidLoad&lt;/code&gt; to sync up the UI when it does finally load:&lt;/p&gt;

&lt;pre&gt;    - (void)windowDidLoad
    {
        if(_name)
            [_nameField setStringValue: _name];
        // more setup code here
    }
&lt;/pre&gt;

&lt;p&gt;Everything else in the nib and the window controller is up to you. It all depends on what you want the window to do. At this point you can create outlets and views and controls as you normally would.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Using the Controller&lt;/b&gt;&lt;br&gt;With this stuff in place, using the controller is simple. First, allocate and initialize it:&lt;/p&gt;

&lt;pre&gt;    MAImportantThingWindowController *controller = [[MAImportantThingWindowController alloc] init];
&lt;/pre&gt;

&lt;p&gt;Perform any necessary setup:&lt;/p&gt;

&lt;pre&gt;    [controller setName: _name];
&lt;/pre&gt;

&lt;p&gt;Then show the window:&lt;/p&gt;

&lt;pre&gt;    [controller showWindow: nil];
&lt;/pre&gt;

&lt;p&gt;If there are multiple windows of this kind floating around, you usually want to add this controller to an array that holds all of the controllers of this type:&lt;/p&gt;

&lt;pre&gt;    [_importantThingControllers addObject: controller];
&lt;/pre&gt;

&lt;p&gt;If there's only one, then you'll probably want an instance variable to hold it:&lt;/p&gt;

&lt;pre&gt;    _importantThingController = controller;
&lt;/pre&gt;

&lt;p&gt;That's it! As you use it, you may find that you need to pass more data through from whatever is instantiating and manipulating the controller. To do this, just add setters to the controller class that manipulate the UI as needed, and call those setters as appropriate.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;There are a lot of ways to manage windows in Cocoa, but most of those ways are wrong. Sadly, there are a lot of different, incorrect techniques floating around out there, up to and including Apple's own Xcode templates. Now you know the proper way to do it. Just remember the principle:&lt;/p&gt;

&lt;pre&gt;    one window = one nib + one NSWindowController subclass
&lt;/pre&gt;

&lt;p&gt;That's it for today. Check back next time for more wacky shenanigans. Friday Q&amp;amp;A is driven by reader ideas, so if you have a topic you'd like to see here, please &lt;a href="mailto:mike@mikeash.com"&gt;send it to me&lt;/a&gt;.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-04-05-windows-and-window-controllers.html</guid><pubDate>Fri, 05 Apr 2013 13:36:00 GMT</pubDate></item><item><title>Objective-C Literals in Serbo-Croatian
</title><link>http://www.mikeash.com/pyblog/objective-c-literals-in-serbo-croatian.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Objective-C Literals in Serbo-Croatian
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 03 26  23 48"
                  tags="clang objectivec translation"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Objective-C Literals in Serbo-Croatian
&lt;/div&gt;
              &lt;p&gt;Reader Anja Skrba from &lt;a href="http://webhostinggeeks.com/"&gt; Webhostinggeeks.com&lt;/a&gt; has translated my &lt;a href="friday-qa-2012-06-22-objective-c-literals.html"&gt;Objective-C Literals&lt;/a&gt; article into Serbo-Croatian. It's always fun to see translations of my writing, even when I can't understand them at all. If you do understand Serbo-Croatian, or know someone who does, check it out. &lt;a href="http://science.webhostinggeeks.com/objective-c-literals"&gt;The Serbo-Croatian version of Objective-C Literals is available on Webhostinggeeks.com&lt;/a&gt;.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/objective-c-literals-in-serbo-croatian.html</guid><pubDate>Tue, 26 Mar 2013 23:48:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-03-22: Let's Build NSInvocation, Part II
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-03-22-lets-build-nsinvocation-part-ii.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-03-22: Let's Build NSInvocation, Part II
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 03 22  14 57"
                  tags="fridayqna letsbuild objectivec"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-03-22: Let's Build NSInvocation, Part II
&lt;/div&gt;
              &lt;p&gt;&lt;a href="friday-qa-2013-03-08-lets-build-nsinvocation-part-i.html"&gt;Last time on Friday Q&amp;amp;A&lt;/a&gt;, I began the reimplementation of parts of &lt;code&gt;NSInvocation&lt;/code&gt; as &lt;code&gt;MAInvocation&lt;/code&gt;. In that article, I discussed the basic theory, the architecture calling conventions, and presented the assembly language glue code needed for the implementation. Today, I present the Objective-C part of &lt;code&gt;MAInvocation&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Recap&lt;/b&gt;&lt;br&gt;&lt;code&gt;MAInvocation&lt;/code&gt; is my reimplementation of a large chunk of &lt;code&gt;NSInvocation&lt;/code&gt;. For simplicity, it doesn't support floating point arguments or return values, and it also doesn't support &lt;code&gt;struct&lt;/code&gt; arguments. It only supports the &lt;code&gt;x86-64&lt;/code&gt; architecture. The code is on GitHub here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mikeash/MAInvocation"&gt;https://github.com/mikeash/MAInvocation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first six parameters to a function are passed in six registers: &lt;code&gt;rdi&lt;/code&gt;, &lt;code&gt;rsi&lt;/code&gt;, &lt;code&gt;rdx&lt;/code&gt;, &lt;code&gt;rcx&lt;/code&gt;, &lt;code&gt;r8&lt;/code&gt;, and &lt;code&gt;r9&lt;/code&gt;. Subsequent parameters, if any, are passed on the stack. Return values are returned in &lt;code&gt;rax&lt;/code&gt;. For the special case of a two-element struct, the second element is returned in &lt;code&gt;rdx&lt;/code&gt;. Larger structs are returned by having the caller allocate memory, and then a pointer to that memory is implicitly passed in as the first parameter in &lt;code&gt;rdi&lt;/code&gt;, with all explicit parameters bumped down. These are called &lt;code&gt;stret&lt;/code&gt; calls in the Objective-C world.&lt;/p&gt;

&lt;p&gt;Assembly language glue is used to translate between values held in a &lt;code&gt;struct&lt;/code&gt; and actual function calls. The &lt;code&gt;struct&lt;/code&gt; holds all of the registers in question, plus a pointer to stack arguments, plus some additional data:&lt;/p&gt;

&lt;pre&gt;    struct RawArguments
    {
        void *fptr;

        uint64_t rdi;
        uint64_t rsi;
        uint64_t rdx;
        uint64_t rcx;
        uint64_t r8;
        uint64_t r9;

        uint64_t stackArgsCount;
        uint64_t *stackArgs;

        uint64_t rax_ret;
        uint64_t rdx_ret;

        uint64_t isStretCall;
    };
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;MAInvocationCall&lt;/code&gt; function is written in assembly, and was explored &lt;a href="friday-qa-2013-03-08-lets-build-nsinvocation-part-i.html"&gt;in the previous article&lt;/a&gt;. It has this prototype:&lt;/p&gt;

&lt;pre&gt;    void MAInvocationCall(struct RawArguments *);
&lt;/pre&gt;

&lt;p&gt;Objective-C code can fill out a &lt;code&gt;struct RawArguments&lt;/code&gt; with a function pointer and the appropriate register values, then call this function. It will make the function call, and on return, the two return value register fields in the &lt;code&gt;struct&lt;/code&gt; will be filled out with whatever the function returned.&lt;/p&gt;

&lt;p&gt;There are also two forwarding handlers:&lt;/p&gt;

&lt;pre&gt;    void MAInvocationForward(void);
    void MAInvocationForwardStret(void);
&lt;/pre&gt;

&lt;p&gt;These are designed to be invoked by an arbitrary Objective-C method call. They both create a new &lt;code&gt;struct RawArguments&lt;/code&gt;, fill out the argument registers and stack arguments pointer, and then invoke a C function called &lt;code&gt;MAInvocationForwardC&lt;/code&gt;. When that returns, the handlers pass the &lt;code&gt;rax_ret&lt;/code&gt; and &lt;code&gt;rdx_ret&lt;/code&gt; values back to the caller. The only difference between these two handlers is whether they set the &lt;code&gt;isStretCall&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The stage is now set for the Objective-C implementation of &lt;code&gt;MAInvocation&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Interface&lt;/b&gt;&lt;br&gt;The interface to &lt;code&gt;MAInvocation&lt;/code&gt; is the same as &lt;code&gt;NSInvocation&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    @interface MAInvocation : NSObject

    + (MAInvocation *)invocationWithMethodSignature:(NSMethodSignature *)sig;

    - (NSMethodSignature *)methodSignature;

    - (void)retainArguments;
    - (BOOL)argumentsRetained;

    - (id)target;
    - (void)setTarget:(id)target;

    - (SEL)selector;
    - (void)setSelector:(SEL)selector;

    - (void)getReturnValue:(void *)retLoc;
    - (void)setReturnValue:(void *)retLoc;

    - (void)getArgument:(void *)argumentLocation atIndex:(NSInteger)idx;
    - (void)setArgument:(void *)argumentLocation atIndex:(NSInteger)idx;

    - (void)invoke;
    - (void)invokeWithTarget:(id)target;

    @end
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Instance Variables&lt;/b&gt;&lt;br&gt;The method signature is central to an invocation object. The method signature describes how many parameters the method takes, as well as what the types are. This is critical information to be able to figure out how to deal with method parameters and return types. This is why the only way to create an &lt;code&gt;MAInvocation&lt;/code&gt; is with an &lt;code&gt;NSMethodSignature&lt;/code&gt;, and that signature is stored in an instance variable:&lt;/p&gt;

&lt;pre&gt;    @implementation MAInvocation {
        NSMethodSignature *_sig;
&lt;/pre&gt;

&lt;p&gt;The invocation keeps a local &lt;code&gt;struct RawArguments&lt;/code&gt;. This &lt;code&gt;struct&lt;/code&gt; is manipulated directly when setting or getting arguments and return values. When the invocation is invoked, a pointer to the instance variable can be passed directly to the assembly language glue:&lt;/p&gt;

&lt;pre&gt;        struct RawArguments _raw;
&lt;/pre&gt;

&lt;p&gt;Invocations can retain their arguments. This sends &lt;code&gt;retain&lt;/code&gt; to all object arguments, and it also copies C string arguments. Whether arguments are currently retained needs to be tracked, so that they can be properly freed, and so that newly-set arguments can be retained, so there's a flag for that:&lt;/p&gt;

&lt;pre&gt;        BOOL _argumentsRetained;
&lt;/pre&gt;

&lt;p&gt;Finally, there needs to be a buffer to store the return value for &lt;code&gt;stret&lt;/code&gt; calls:&lt;/p&gt;

&lt;pre&gt;        void *_stretBuffer;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Initialization&lt;/b&gt;&lt;br&gt;The factory method just calls an &lt;code&gt;init&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;    + (NSInvocation *)invocationWithMethodSignature: (NSMethodSignature *)sig
    {
        return [[[self alloc] initWithMethodSignature: sig] autorelease];
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;init&lt;/code&gt; method saves the method signature:&lt;/p&gt;

&lt;pre&gt;    - (id)initWithMethodSignature: (NSMethodSignature *)sig
    {
        if((self = [super init]))
        {
            _sig = [sig retain];
&lt;/pre&gt;

&lt;p&gt;It then populates the &lt;code&gt;isStretCall&lt;/code&gt; of the &lt;code&gt;struct RawArguments&lt;/code&gt; by examining the method signature to determine whether it fits the requirements for a &lt;code&gt;stret&lt;/code&gt; call. This is done by calling another method. The code for that method is rather involved, and will come later:&lt;/p&gt;

&lt;pre&gt;            _raw.isStretCall = [self isStretReturn];
&lt;/pre&gt;

&lt;p&gt;Next, the stack arguments are set up. The first thing to do here is to get the total number of arguments from the method signature:&lt;/p&gt;

&lt;pre&gt;            NSUInteger argsCount = [sig numberOfArguments];
&lt;/pre&gt;

&lt;p&gt;Note that this count includes the two implicit arguments, &lt;code&gt;self&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt;, so this number is exactly equal to the number of function arguments being passed.&lt;/p&gt;

&lt;p&gt;If it's a &lt;code&gt;stret&lt;/code&gt; call, then there's effectively one more argument, because of the implicit pointer for the return value passed in &lt;code&gt;rdi&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            if(_raw.isStretCall)
                argsCount++;
&lt;/pre&gt;

&lt;p&gt;If there are more than six arguments (potentially including the implicit &lt;code&gt;stret&lt;/code&gt; parameter), then there are stack arguments. &lt;code&gt;stackArgsCount&lt;/code&gt; is set to the number of remaining arguments, and memory is allocated so that &lt;code&gt;stackArgs&lt;/code&gt; can hold them:&lt;/p&gt;

&lt;pre&gt;            if(argsCount &amp;gt; 6)
            {
                _raw.stackArgsCount = argsCount - 6;
                _raw.stackArgs = calloc(argsCount - 6, sizeof(*_raw.stackArgs));
            }
        }
        return self;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Wrapper Methods&lt;/b&gt;&lt;br&gt;There are a few methods in the API that are simply small wrappers around other methods. I'll cover them here before we get to the real meat.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;target&lt;/code&gt; method just gets the value of the first argument in a slightly more convenient way. The method is just a small wrapper around the &lt;code&gt;getArgument:atIndex:&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;    - (id)target
    {
        id target;
        [self getArgument: &amp;amp;target atIndex: 0];
        return target;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;setTarget:&lt;/code&gt; method is an even simpler wrapper around the &lt;code&gt;setArgument:atIndex:&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;    - (void)setTarget: (id)target
    {
        [self setArgument: &amp;amp;target atIndex: 0];
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;selector&lt;/code&gt; and &lt;code&gt;setSelector:&lt;/code&gt; methods are virtually identical, but manipulate the second argument:&lt;/p&gt;

&lt;pre&gt;    - (SEL)selector
    {
        SEL sel;
        [self getArgument: &amp;amp;sel atIndex: 1];
        return sel;
    }

    - (void)setSelector: (SEL)selector
    {
        [self setArgument: &amp;amp;selector atIndex: 1];
    }
&lt;/pre&gt;

&lt;p&gt;Finally, the &lt;code&gt;invoke&lt;/code&gt; method calls &lt;code&gt;invokeWithTarget:&lt;/code&gt;, passing &lt;code&gt;[self target]&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (void)invoke
    {
        [self invokeWithTarget: [self target]];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Getting Arguments&lt;/b&gt;&lt;br&gt;In order to get an argument out of the invocation, the code first needs to know &lt;em&gt;where&lt;/em&gt; an argument is stored. This small wrapper method handles that:&lt;/p&gt;

&lt;pre&gt;    - (uint64_t *)argumentPointerAtIndex: (NSInteger)idx
    {
        uint64_t *ptr = NULL;
        if(idx == 0)
            ptr = &amp;amp;_raw.rdi;
        if(idx == 1)
            ptr = &amp;amp;_raw.rsi;
        if(idx == 2)
            ptr = &amp;amp;_raw.rdx;
        if(idx == 3)
            ptr = &amp;amp;_raw.rcx;
        if(idx == 4)
            ptr = &amp;amp;_raw.r8;
        if(idx == 5)
            ptr = &amp;amp;_raw.r9;
        if(idx &amp;gt;= 6)
            ptr = _raw.stackArgs + idx - 6;
        return ptr;
    }
&lt;/pre&gt;

&lt;p&gt;This method takes a &lt;em&gt;raw&lt;/em&gt; argument index, which is to say that it's already been adjusted to take into account whether or not this is a &lt;code&gt;stret&lt;/code&gt; call. It then maps that index onto the appropriate register or stack slot.&lt;/p&gt;

&lt;p&gt;It's also handy to be able to get the size of a particular argument, so it can copy the right number of bytes for arguments that are smaller than 8 bytes. This method wraps the Foundation function &lt;code&gt;NSGetSizeAndAlignment&lt;/code&gt;, which takes an Objective-C type string and returns the size (and alignment!) of the type in question:&lt;/p&gt;

&lt;pre&gt;    - (NSUInteger)sizeOfType: (const char *)type
    {
        NSUInteger size;
        NSGetSizeAndAlignment(type, &amp;amp;size, NULL);
        return size;
    }
&lt;/pre&gt;

&lt;p&gt;Another small wrapper around &lt;em&gt;this&lt;/em&gt; method provides the size of a given argument:&lt;/p&gt;

&lt;pre&gt;    - (NSUInteger)sizeAtIndex: (NSInteger)idx
    {
        return [self sizeOfType: [_sig getArgumentTypeAtIndex: idx]];
    }
&lt;/pre&gt;

&lt;p&gt;To actually fetch an argument, the method first adjusts the requested index to take into account the possible &lt;code&gt;stret&lt;/code&gt; return:&lt;/p&gt;

&lt;pre&gt;    - (void)getArgument: (void *)argumentLocation atIndex: (NSInteger)idx
    {
        NSInteger rawArgumentIndex = idx;
        if(_raw.isStretCall)
            rawArgumentIndex++;
&lt;/pre&gt;

&lt;p&gt;Next, it grabs the pointer from the above method, and checks it for sanity:&lt;/p&gt;

&lt;pre&gt;        uint64_t *src = [self argumentPointerAtIndex: rawArgumentIndex];
        assert(src);
&lt;/pre&gt;

&lt;p&gt;Then it grabs the argument size and copies the appropriate number of bytes out of the argument location:&lt;/p&gt;

&lt;pre&gt;        NSUInteger size = [self sizeAtIndex: idx];
        memcpy(argumentLocation, src, size);
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Getting and Setting Return Values&lt;/b&gt;&lt;br&gt;To get and set the return value, the value's size is needed. This is easy to obtain by just getting the size of the return type obtained from the method signature:&lt;/p&gt;

&lt;pre&gt;    - (NSUInteger)returnValueSize
    {
        return [self sizeOfType: [_sig methodReturnType]];
    }
&lt;/pre&gt;

&lt;p&gt;It's also necessary to get a pointer to the location where the return value is stored. If the invocation is for a &lt;code&gt;stret&lt;/code&gt; call, then it returns &lt;code&gt;_stretBuffer&lt;/code&gt;. If the buffer isn't allocated yet, it allocates it:&lt;/p&gt;

&lt;pre&gt;    - (void *)returnValuePtr
    {
        if(_raw.isStretCall)
        {
            if(_stretBuffer == NULL)
                _stretBuffer = calloc(1, [self returnValueSize]);
            return _stretBuffer;
        }
&lt;/pre&gt;

&lt;p&gt;For regular calls, it just returns the address of the &lt;code&gt;rax_ret&lt;/code&gt; field in the raw arguments &lt;code&gt;struct&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        else
        {
            return &amp;amp;_raw.rax_ret;
        }
    }
&lt;/pre&gt;

&lt;p&gt;This takes care of the case where the return value uses both return registers. Since they're contiguous in the &lt;code&gt;struct&lt;/code&gt;, copying a sufficiently large value into this address will write to both register fields in the &lt;code&gt;struct&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With these methods available, writing the methods to get and set the return value is easy. All they have to do is call &lt;code&gt;memcpy&lt;/code&gt; with the computed size and pointer:&lt;/p&gt;

&lt;pre&gt;    - (void)getReturnValue: (void *)retLoc
    {
        NSUInteger size = [self returnValueSize];
        memcpy(retLoc, [self returnValuePtr], size);
    }

    - (void)setReturnValue: (void *)retLoc
    {
        NSUInteger size = [self returnValueSize];
        memcpy([self returnValuePtr], retLoc, size);
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Type Classification&lt;/b&gt;&lt;br&gt;Determining whether a method's return type requires a &lt;code&gt;stret&lt;/code&gt; call requires classifying that type according to the &lt;code&gt;x86-64&lt;/code&gt; calling conventions. The &lt;code&gt;NSInvocation&lt;/code&gt; API allows for retaining the arguments to the invocation, which also requires classifying the argument types, so that all of the object types can be found.&lt;/p&gt;

&lt;p&gt;The different classifications get put into an &lt;code&gt;enum&lt;/code&gt; which combines the relevant parts of the &lt;code&gt;x86-64&lt;/code&gt; ABI with the distinctions necessary for retaining arguments. This boils down to objects, blocks, C strings, other integer types (including non-object pointers), a &lt;code&gt;struct&lt;/code&gt; containing &lt;em&gt;two&lt;/em&gt; integers, an empty &lt;code&gt;struct&lt;/code&gt;, any other &lt;code&gt;struct&lt;/code&gt;, and any other type not already covered:&lt;/p&gt;

&lt;pre&gt;    enum TypeClassification
    {
        TypeObject,
        TypeBlock,
        TypeCString,
        TypeInteger,
        TypeTwoIntegers,
        TypeEmptyStruct,
        TypeStruct,
        TypeOther
    };
&lt;/pre&gt;

&lt;p&gt;The classification process itself consists of two mutually-recursive methods: one that classifies arbitrary types, and one specialized to classify &lt;code&gt;struct&lt;/code&gt; types.&lt;/p&gt;

&lt;p&gt;The general method starts by creating the type strings for 'id', blocks, and C strings by using the &lt;code&gt;@encode&lt;/code&gt; directive:&lt;/p&gt;

&lt;pre&gt;    - (enum TypeClassification)classifyType: (const char *)type
    {
        const char *idType = @encode(id);
        const char *blockType = @encode(void (^)(void));
        const char *charPtrType = @encode(char *);
&lt;/pre&gt;

&lt;p&gt;Note that all blocks have the same type string when it comes to &lt;code&gt;@encode&lt;/code&gt;, so the choice of block type here is completely arbitrary.&lt;/p&gt;

&lt;p&gt;With these in hand, it compares &lt;code&gt;type&lt;/code&gt; against them, and returns the appropriate &lt;code&gt;enum&lt;/code&gt; value if there's a match:&lt;/p&gt;

&lt;pre&gt;        if(strcmp(type, idType) == 0)
            return TypeObject;
        if(strcmp(type, blockType) == 0)
            return TypeBlock;
        if(strcmp(type, charPtrType) == 0)
            return TypeCString;
&lt;/pre&gt;

&lt;p&gt;Next, it checks integer types. This crazy bit of code constructs a C string that contains every the character for every integer type, plus function pointers (which is just &lt;code&gt;?&lt;/code&gt;), plus any other pointer (which all start with &lt;code&gt;^&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;        char intTypes[] = { @encode(signed char)[0], @encode(unsigned char)[0], @encode(short)[0], @encode(unsigned short)[0], @encode(int)[0], @encode(unsigned int)[0], @encode(long)[0], @encode(unsigned long)[0], @encode(long long)[0], @encode(unsigned long long)[0], '?', '^', 0 };
&lt;/pre&gt;

&lt;p&gt;With that C string in hand, the &lt;code&gt;strchr&lt;/code&gt; function can be used to check the first character in &lt;code&gt;type&lt;/code&gt; aganist all of these characters. If there's a match, then the type is an integer type:&lt;/p&gt;

&lt;pre&gt;        if(strchr(intTypes, type[0]))
            return TypeInteger;
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;struct&lt;/code&gt; types begin with the &lt;code&gt;{&lt;/code&gt; character. If the type string starts with that character, then call into the struct classifier:&lt;/p&gt;

&lt;pre&gt;        if(type[0] == '{')
            return [self classifyStructType: type];
&lt;/pre&gt;

&lt;p&gt;If nothing matches, then return the "other" type:&lt;/p&gt;

&lt;pre&gt;        return TypeOther;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;struct&lt;/code&gt; classifier uses a helper method that takes the type string for the &lt;code&gt;struct&lt;/code&gt; and enumerates over all of its contents. It tracks the struct's classification at each point, and updates it with each new element in the &lt;code&gt;struct&lt;/code&gt;. It starts out with an empty &lt;code&gt;struct&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (enum TypeClassification)classifyStructType: (const char *)type
    {
        __block enum TypeClassification structClassification = TypeEmptyStruct;
&lt;/pre&gt;

&lt;p&gt;Then it enumerates and classifies each type within:&lt;/p&gt;

&lt;pre&gt;        [self enumerateStructElementTypes: type block: ^(const char *type) {
            enum TypeClassification elementClassification = [self classifyType: type];
&lt;/pre&gt;

&lt;p&gt;If the current classification is an empty &lt;code&gt;struct&lt;/code&gt;, then the new classification is the same as the element classification. A struct with one element is classified the same as the element it contains:&lt;/p&gt;

&lt;pre&gt;            if(structClassification == TypeEmptyStruct)
                structClassification = elementClassification;
&lt;/pre&gt;

&lt;p&gt;If the current classification is an integer type and the element classification is also an integer type, then the &lt;code&gt;struct&lt;/code&gt; gets the special classification of a &lt;code&gt;struct&lt;/code&gt; containing two integers:&lt;/p&gt;

&lt;pre&gt;            else if([self isIntegerClass: structClassification] &amp;amp;&amp;amp; [self isIntegerClass: elementClassification])
                structClassification = TypeTwoIntegers;
&lt;/pre&gt;

&lt;p&gt;In any other circumstance (&lt;code&gt;struct&lt;/code&gt; contains more than two elements, &lt;code&gt;struct&lt;/code&gt; contains floating-point elements, etc.) the classification is just a generic &lt;code&gt;struct&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            else
                structClassification = TypeStruct;
        }];
        return structClassification;
    }
&lt;/pre&gt;

&lt;p&gt;The method to enumerate over a &lt;code&gt;struct&lt;/code&gt; type string's elements is short. A &lt;code&gt;struct&lt;/code&gt; type string consists of the &lt;code&gt;struct&lt;/code&gt;'s name, the &lt;code&gt;=&lt;/code&gt; symbol, and then each element's type concatenated together, all contained within a pair of &lt;code&gt;{}&lt;/code&gt;. For example, &lt;code&gt;NSRange&lt;/code&gt; would look like:&lt;/p&gt;

&lt;pre&gt;    {NSRange=LL}
&lt;/pre&gt;

&lt;p&gt;The first thing the method does is find the &lt;code&gt;=&lt;/code&gt; and start scanning just beyond it:&lt;/p&gt;

&lt;pre&gt;    - (void)enumerateStructElementTypes: (const char *)type block: (void (^)(const char *type))block
    {
        const char *equals = strchr(type, '=');
        const char *cursor = equals + 1;
&lt;/pre&gt;

&lt;p&gt;Then it enumerates over each type contained within, taking advantage of &lt;code&gt;NSGetSizeAndAlignment&lt;/code&gt; to move the cursor to the end of each type encountered, even if the type contains more than one character. It does this until it encounters a closing brace:&lt;/p&gt;

&lt;pre&gt;        while(*cursor != '}')
        {
            block(cursor);
            cursor = NSGetSizeAndAlignment(cursor, NULL, NULL);
        }
    }
&lt;/pre&gt;

&lt;p&gt;There's also a short helper method that determines whether a particular type classification is considered an integer. This just checks to see if the classification is an object, block, C string, or an actual integer or other pointer:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)isIntegerClass: (enum TypeClassification)classification
    {
        return classification == TypeObject || classification == TypeBlock || classification == TypeCString || classification == TypeInteger;
    }
&lt;/pre&gt;

&lt;p&gt;This finishes the type classification system. This is somewhat rudimentary compared to the full complexity of the &lt;code&gt;x86-64&lt;/code&gt; spec, but it's enough for &lt;code&gt;MAInvocation&lt;/code&gt;'s needs. With type classification available, we can finally implement the &lt;code&gt;isStretReturn&lt;/code&gt; method used in the initializer:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)isStretReturn
    {
        return [self classifyType: [_sig methodReturnType]] == TypeStruct;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Setting Arguments&lt;/b&gt;&lt;br&gt;With type classification in place, it's finally time to implement setting arguments. The basic form of &lt;code&gt;setArgument:atIndex:&lt;/code&gt; is nearly identical to &lt;code&gt;getArgument:atIndex:&lt;/code&gt;, but the need to support retained arguments makes everything far more complicated.&lt;/p&gt;

&lt;p&gt;It's possible to create an &lt;code&gt;NSInvocation&lt;/code&gt;, set it up, and then keep it around for a while. In order for the &lt;code&gt;NSInvocation&lt;/code&gt; to remain valid, it needs to be able to do proper memory management on the arguments it contains. In a nod to flexibility, this is optional. A freshly-made &lt;code&gt;NSInvocation&lt;/code&gt; doesn't do any memory management on its arguments, but it can be enabled by sending it a &lt;code&gt;retainArguments&lt;/code&gt; message.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;MAInvocation&lt;/code&gt; mimics this functionality. When it receives &lt;code&gt;retainArguments&lt;/code&gt; it performs the following operations on its arguments:&lt;/p&gt;

&lt;pre&gt;    1. Block arguments are copied.
    2. Non-block object arguments are retained.
    3. C string arguments are copied.
    4. All others are left alone.
&lt;/pre&gt;

&lt;p&gt;In addition to doing this for &lt;code&gt;retainArguments&lt;/code&gt;, the &lt;code&gt;setArgument:atIndex:&lt;/code&gt; method needs to do this for each newly set argument as well. This is what makes it so much more complex than &lt;code&gt;getArgument:atIndex:&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The method starts off by computing the raw argument index:&lt;/p&gt;

&lt;pre&gt;    - (void)setArgument: (void *)argumentLocation atIndex: (NSInteger)idx
    {
        NSInteger rawArgumentIndex = idx;
        if(_raw.isStretCall)
            rawArgumentIndex++;
&lt;/pre&gt;

&lt;p&gt;Next, it gets the argument pointer at that index:&lt;/p&gt;

&lt;pre&gt;        uint64_t *dest = [self argumentPointerAtIndex: rawArgumentIndex];
        assert(dest);
&lt;/pre&gt;

&lt;p&gt;Then it classifies the argument at this index:&lt;/p&gt;

&lt;pre&gt;        enum TypeClassification c = [self classifyArgumentAtIndex: idx];
&lt;/pre&gt;

&lt;p&gt;If arguments are retained, it will then check the classification of the arguments to see if it's a block, a non-block object, or a C string. If it is, then it directly sets &lt;code&gt;dest&lt;/code&gt; to the value found at &lt;code&gt;argumentLocation&lt;/code&gt; using the appropriate &lt;code&gt;retain&lt;/code&gt; or &lt;code&gt;copy&lt;/code&gt; semantics. If argument is of a different type, or if arguments aren't retained, it does a simple &lt;code&gt;memcpy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The first case is for plain objects. This just does a fairly standard &lt;code&gt;retain&lt;/code&gt;/&lt;code&gt;release&lt;/code&gt; combination, with a bunch of casting to treat both pointers as object pointers. The &lt;code&gt;release&lt;/code&gt; is done at the end using the &lt;code&gt;old&lt;/code&gt; variable to avoid problems where releasing the old value invalidates the new one:&lt;/p&gt;

&lt;pre&gt;        if(_argumentsRetained &amp;amp;&amp;amp; c == TypeObject)
        {
            id old = *(id *)dest;
            *(id *)dest = [*(id *)argumentLocation retain];
            [old release];
        }
&lt;/pre&gt;

&lt;p&gt;Blocks get the same treatment, except they get a &lt;code&gt;copy&lt;/code&gt; rather than a &lt;code&gt;retain&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        else if(_argumentsRetained &amp;amp;&amp;amp; c == TypeBlock)
        {
            id old = *(id *)dest;
            *(id *)dest = [*(id *)argumentLocation copy];
            [old release];
        }
&lt;/pre&gt;

&lt;p&gt;C strings are similar, but use &lt;code&gt;strdup&lt;/code&gt; and &lt;code&gt;free&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        else if(_argumentsRetained &amp;amp;&amp;amp; c == TypeCString)
        {
            char *old = *(char **)dest;

            char *cstr = *(char **)argumentLocation;
            if(cstr != NULL)
                cstr = strdup(cstr);
            *(char **)dest = cstr;

            free(old);
        }
&lt;/pre&gt;

&lt;p&gt;In all other cases, the appropriate number of bytes is copied over using &lt;code&gt;memcpy&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        else
        {
            NSUInteger size = [self sizeAtIndex: idx];
            memcpy(dest, argumentLocation, size);
        }
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;classifyArgumentAtIndex:&lt;/code&gt; is a small wrapper around &lt;code&gt;classifyType:&lt;/code&gt; that retrieves the argument type from the method signature and classifies it:&lt;/p&gt;

&lt;pre&gt;    - (enum TypeClassification)classifyArgumentAtIndex: (NSUInteger)idx
    {
        return [self classifyType: [_sig getArgumentTypeAtIndex: idx]];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Retaining Arguments&lt;/b&gt;&lt;br&gt;In addition to retaining each argument as it arrives in &lt;code&gt;setArgument:atIndex:&lt;/code&gt;, &lt;code&gt;MAInvocation&lt;/code&gt; also needs to retain all existing arguments in the &lt;code&gt;retainArguments&lt;/code&gt; method. Only the first call does anything, so the first thing that method does is check to see if arguments are already retained, and bail out if so:&lt;/p&gt;

&lt;pre&gt;    - (void)retainArguments
    {
        if(_argumentsRetained)
            return;
&lt;/pre&gt;

&lt;p&gt;Next, it iterates over all retainable arguments, using a helper method. This method invokes a block for each retainable argument that passes in the argument index as well as the argument's value. There are three value arguments in the block, and only one is set for any given call.&lt;/p&gt;

&lt;pre&gt;        [self iterateRetainableArguments: ^(NSUInteger idx, id obj, id block, char *cstr) {
&lt;/pre&gt;

&lt;p&gt;If it's an object argument, then that argument is retained:&lt;/p&gt;

&lt;pre&gt;            if(obj)
            {
                [obj retain];
            }
&lt;/pre&gt;

&lt;p&gt;If it's a block argument, the block is copied, and the new value set as the argument value. Note that &lt;code&gt;_argumentsRetained&lt;/code&gt; has not yet been set to &lt;code&gt;YES&lt;/code&gt;, so &lt;code&gt;setArgument:atIndex:&lt;/code&gt; won't try to do its own memory management, avoiding any conflict between the two:&lt;/p&gt;

&lt;pre&gt;            else if(block)
            {
                block = [block copy];
                [self setArgument: &amp;amp;block atIndex: idx];
            }
&lt;/pre&gt;

&lt;p&gt;If it's a C string argument, it uses &lt;code&gt;strdup&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            else if(cstr)
            {
                if(cstr != NULL)
                    cstr = strdup(cstr);
                [self setArgument: &amp;amp;cstr atIndex: idx];
            }
        }];
&lt;/pre&gt;

&lt;p&gt;Finally, it sets &lt;code&gt;_argumentsRetained&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        _argumentsRetained = YES;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;iterateRetainableArguments:&lt;/code&gt; method uses the type classification system to figure out what each argument is, then calls &lt;code&gt;getArgument:atIndex:&lt;/code&gt; to fetch the value. It first iterates over each argument and classifies it:&lt;/p&gt;

&lt;pre&gt;    - (void)iterateRetainableArguments: (void (^)(NSUInteger idx, id obj, id block, char *cstr))block
    {
        for(NSUInteger i = 0; i &amp;lt; [_sig numberOfArguments]; i++)
        {
            enum TypeClassification c = [self classifyArgumentAtIndex: i];
&lt;/pre&gt;

&lt;p&gt;Objects and blocks are both handled by the same branch. It first retrieves the argument into a local &lt;code&gt;id&lt;/code&gt; variable:&lt;/p&gt;

&lt;pre&gt;            if(c == TypeObject || c == TypeBlock)
            {
                id arg;
                [self getArgument: &amp;amp;arg atIndex: i];
&lt;/pre&gt;

&lt;p&gt;It then moves &lt;code&gt;arg&lt;/code&gt; into one of two other local variables depending on whether the type is a block or a plain object:&lt;/p&gt;

&lt;pre&gt;                id o = c == TypeObject ? arg : nil;
                id b = c == TypeBlock ? arg : nil;
&lt;/pre&gt;

&lt;p&gt;At this point, &lt;code&gt;o&lt;/code&gt; contains the argument value if it's a plain argument, and &lt;code&gt;b&lt;/code&gt; contains the argument value if it's a block. The iteration block can then be called with these values:&lt;/p&gt;

&lt;pre&gt;                block(i, o, b, NULL);
            }
&lt;/pre&gt;

&lt;p&gt;C strings are similar, but less complex, because there's only one possible type here:&lt;/p&gt;

&lt;pre&gt;            else if(c == TypeCString)
            {
                char *arg;
                [self getArgument: &amp;amp;arg atIndex: i];

                block(i, nil, nil, arg);
            }
        }
    }
&lt;/pre&gt;

&lt;p&gt;While we're at it, here's a quick getter method for &lt;code&gt;argumentsRetained&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)argumentsRetained
    {
        return _argumentsRetained;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Dealloc&lt;/b&gt;&lt;br&gt;The hardest part of &lt;code&gt;dealloc&lt;/code&gt; is freeing the retained arguments. The &lt;code&gt;iterateRetainableArguments:&lt;/code&gt; method takes care of most of the work:&lt;/p&gt;

&lt;pre&gt;    - (void)dealloc
    {
        if(_argumentsRetained)
        {
            [self iterateRetainableArguments: ^(NSUInteger idx, id obj, id block, char *cstr) {
                [obj release];
                [block release];
                free(cstr);
            }];
        }
&lt;/pre&gt;

&lt;p&gt;With that taken care of, all that remains is freeing the method signature, the &lt;code&gt;stackArgs&lt;/code&gt; pointer, and calling &lt;code&gt;super&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        [_sig release];
        free(_raw.stackArgs);

        [super dealloc];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Invocation&lt;/b&gt;&lt;br&gt;The code so far has kept the &lt;code&gt;struct RawArguments&lt;/code&gt; almost completely up to date. Implementing &lt;code&gt;invokeWithTarget:&lt;/code&gt; is simply a matter of filling out the last details, then making a call to the &lt;code&gt;MAInvocationCall&lt;/code&gt; assembly glue function. The method starts out by setting the target value:&lt;/p&gt;

&lt;pre&gt;    - (void)invokeWithTarget: (id)target
    {
        [self setTarget: target];
&lt;/pre&gt;

&lt;p&gt;It then uses &lt;code&gt;methodForSelector:&lt;/code&gt; to get the function pointer for the invocation's selector and places that into the &lt;code&gt;fptr&lt;/code&gt; field. This is what the glue code will call:&lt;/p&gt;

&lt;pre&gt;        _raw.fptr = [target methodForSelector: [self selector]];
&lt;/pre&gt;

&lt;p&gt;If this is a &lt;code&gt;stret&lt;/code&gt; call, then &lt;code&gt;rdi&lt;/code&gt; needs to be set up to point to space that can hold the return value:&lt;/p&gt;

&lt;pre&gt;        if(_raw.isStretCall)
            _raw.rdi = (uint64_t)[self returnValuePtr];
&lt;/pre&gt;

&lt;p&gt;Finally, call the assembly glue:&lt;/p&gt;

&lt;pre&gt;        MAInvocationCall(&amp;amp;_raw);
    }
&lt;/pre&gt;

&lt;p&gt;With all of the register fields and the stack arguments pointer set up, and the function pointer field set to the target &lt;code&gt;IMP&lt;/code&gt;, the assembly glue is able to make the call. Upon return, the assembly glue copies &lt;code&gt;rax&lt;/code&gt; and &lt;code&gt;rdx&lt;/code&gt; into the return value fields of the &lt;code&gt;struct RawArguments&lt;/code&gt;. This means that the return value is already set when the assembly glue returns, and will be available from &lt;code&gt;getReturnValue:&lt;/code&gt; without any additional action in the Objective-C code.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Forwarding&lt;/b&gt;&lt;br&gt;The last major piece of &lt;code&gt;MAInvocation&lt;/code&gt; is the &lt;code&gt;MAInvocationForwardC&lt;/code&gt; function. The assembly language forwarding glue intercepts unknown message calls. It then constructs a &lt;code&gt;struct RawArguments&lt;/code&gt; on the stack from the function call, and then calls through to &lt;code&gt;MAInvocationForwardC&lt;/code&gt;, passing it a pointer to the &lt;code&gt;struct RawArguments&lt;/code&gt;. The remainder of the logic is implemented in Objective-C:&lt;/p&gt;

&lt;pre&gt;    void MAInvocationForwardC(struct RawArguments *r)
    {
&lt;/pre&gt;

&lt;p&gt;The first order of business is to get the object that the message was sent to, and the selector being sent. For a &lt;code&gt;stret&lt;/code&gt; call, the object is in &lt;code&gt;rsi&lt;/code&gt; and the selector is in &lt;code&gt;rdx&lt;/code&gt;. For a normal call, the object is in &lt;code&gt;rdi&lt;/code&gt; and the selector is in &lt;code&gt;rsi&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        id obj;
        SEL sel;

        if(r-&amp;gt;isStretCall)
        {
            obj = (id)r-&amp;gt;rsi;
            sel = (SEL)r-&amp;gt;rdx;
        }
        else
        {
            obj = (id)r-&amp;gt;rdi;
            sel = (SEL)r-&amp;gt;rsi;
        }
&lt;/pre&gt;

&lt;p&gt;A method signature is critical to creating an invocation object. With the object and selector available, a simple call to &lt;code&gt;methodSignatureForSelector:&lt;/code&gt; obtains that:&lt;/p&gt;

&lt;pre&gt;        NSMethodSignature *sig = [obj methodSignatureForSelector: sel];
&lt;/pre&gt;

&lt;p&gt;With the method signature in hand, the forwarding function can now create an &lt;code&gt;MAInvocation&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        MAInvocation *inv = [[MAInvocation alloc] initWithMethodSignature: sig];
&lt;/pre&gt;

&lt;p&gt;The next order of business is to copy all of the pertinent information from &lt;code&gt;r into the invocation's&lt;/code&gt;_raw` instance variable. First come the registers:&lt;/p&gt;

&lt;pre&gt;        inv-&amp;gt;_raw.rdi = r-&amp;gt;rdi;
        inv-&amp;gt;_raw.rsi = r-&amp;gt;rsi;
        inv-&amp;gt;_raw.rdx = r-&amp;gt;rdx;
        inv-&amp;gt;_raw.rcx = r-&amp;gt;rcx;
        inv-&amp;gt;_raw.r8 = r-&amp;gt;r8;
        inv-&amp;gt;_raw.r9 = r-&amp;gt;r9;
&lt;/pre&gt;

&lt;p&gt;After that, stack arguments are copied. Although &lt;code&gt;r&lt;/code&gt; always contains &lt;code&gt;0&lt;/code&gt; for &lt;code&gt;stackArgsCount&lt;/code&gt;, the invocation has now computed the number of actual stack arguments, so its &lt;code&gt;_raw&lt;/code&gt; variable can be consulted to get the count:&lt;/p&gt;

&lt;pre&gt;        memcpy(inv-&amp;gt;_raw.stackArgs, r-&amp;gt;stackArgs, inv-&amp;gt;_raw.stackArgsCount * sizeof(uint64_t));
&lt;/pre&gt;

&lt;p&gt;The invocation is now fully constructed and filled out. The object is sent &lt;code&gt;forwardInvocation:&lt;/code&gt; with the newly constructed invocation.&lt;/p&gt;

&lt;pre&gt;        [obj forwardInvocation: (id)inv];
&lt;/pre&gt;

&lt;p&gt;After that call returns, the return value from the invocation needs to be copied back into &lt;code&gt;r&lt;/code&gt;. The assembly glue will then pass the value back to the caller. It first copies the two return value registers over:&lt;/p&gt;

&lt;pre&gt;        r-&amp;gt;rax_ret = inv-&amp;gt;_raw.rax_ret;
        r-&amp;gt;rdx_ret = inv-&amp;gt;_raw.rdx_ret;
&lt;/pre&gt;

&lt;p&gt;If it's a &lt;code&gt;stret&lt;/code&gt; call and the invocation actually has a buffer to hold the return value, the value in the invocation's return value buffer is copied into the memory pointed to by &lt;code&gt;r-&amp;gt;rdi&lt;/code&gt;, which is where the caller specified that it wanted the return value placed:&lt;/p&gt;

&lt;pre&gt;        if(r-&amp;gt;isStretCall &amp;amp;&amp;amp; inv-&amp;gt;_stretBuffer)
        {
            memcpy((void *)r-&amp;gt;rdi, inv-&amp;gt;_stretBuffer, [inv returnValueSize]);
        }
&lt;/pre&gt;

&lt;p&gt;Everything is now complete, so the invocation is released, and control is returned to the assembly language glue:&lt;/p&gt;

&lt;pre&gt;        [inv release];
    }
&lt;/pre&gt;

&lt;p&gt;The glue code will now copy the &lt;code&gt;rax&lt;/code&gt; and &lt;code&gt;rdx&lt;/code&gt; fields back into the respective CPU registers, then return control to the original method caller, which will see the return value either in those registers or in the &lt;code&gt;stret&lt;/code&gt; buffer that it passed in &lt;code&gt;rdi&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;That wraps up the implementation of &lt;code&gt;MAInvocation&lt;/code&gt;. It's enormously complicated and involved, despite only supporting &lt;code&gt;x86-64&lt;/code&gt; and ignoring &lt;code&gt;struct&lt;/code&gt; parameters and floating point of all kinds, which are a large part of the &lt;code&gt;x86-64&lt;/code&gt; calling conventions. &lt;code&gt;NSInvocation&lt;/code&gt; not only supports all types of parameters and return values (aside from a few corner cases like &lt;code&gt;union&lt;/code&gt; parameters), it also supports them on at least three different architectures: &lt;code&gt;i386&lt;/code&gt;, &lt;code&gt;x86-64&lt;/code&gt;, and &lt;code&gt;ARM&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, despite the complication, it's all very much doable. Covering all of the cases requires a lot of time and effort, but there's nothing mysterious or magical. It would require equivalents of the assembly glue functions for the other architectures, expanding the glue functions to cover the floating-point registers, and implementing all of the logic for which arguments go where in the &lt;code&gt;MAInvocation&lt;/code&gt; Objective-C code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;MAInvocation&lt;/code&gt; was a lot of fun to build and gives great insight on just what &lt;code&gt;NSInvocation&lt;/code&gt; is doing. It should be obvious, but don't use &lt;code&gt;MAInvocation&lt;/code&gt; for any real work. &lt;code&gt;NSInvocation&lt;/code&gt; does all the same stuff and more, and no doubt does it better.&lt;/p&gt;

&lt;p&gt;That's it for today. Come back next time for another breathtaking adventure. Friday Q&amp;amp;A is driven by reader ideas, so until then, please keep &lt;a href="mailto:mike@mikeash.com"&gt;sending in your ideas for topics to cover&lt;/a&gt;.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-03-22-lets-build-nsinvocation-part-ii.html</guid><pubDate>Fri, 22 Mar 2013 14:57:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-03-08: Let's Build NSInvocation, Part I
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-03-08-lets-build-nsinvocation-part-i.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-03-08: Let's Build NSInvocation, Part I
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 03 08  14 33"
                  tags="fridayqna letsbuild objectivec"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-03-08: Let's Build NSInvocation, Part I
&lt;/div&gt;
              &lt;p&gt;It's time for another trip into the nether regions of the soul. Reader Robby Walker suggested an article about &lt;code&gt;NSInvocation&lt;/code&gt;, and I have obliged, implementing it from scratch for your amusement. Today I'll start on a guided tour down the hall of horrors that is &lt;code&gt;MAInvocation&lt;/code&gt;, my reimplementation of the &lt;code&gt;NSInvocation&lt;/code&gt; API. It's a big project, so today I'm going to focus on the basic principles and the assembly language glue code, with the rest of the implementation to follow.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Code&lt;/b&gt;&lt;br&gt;The code for &lt;code&gt;MAInvocation&lt;/code&gt; is available on GitHub here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mikeash/MAInvocation"&gt;https://github.com/mikeash/MAInvocation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Overview&lt;/b&gt;&lt;br&gt;An &lt;code&gt;NSInvocation&lt;/code&gt; object represents a single method invocation. A method invocation has a target, a selector, a set of arguments, and a return value.&lt;/p&gt;

&lt;p&gt;Just holding these values would be pretty boring. You can whip up a model class pretty easily to do that. Have a variable for the return value, an array for the arguments, and you're done. (The target and selector are just the first and second arguments.) Where &lt;code&gt;NSInvocation&lt;/code&gt; gets interesting is in its ability to actually capture and send the invocations that it represents.&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;NSInvocation&lt;/code&gt; can be &lt;em&gt;invoked&lt;/em&gt; on a particular object. This does the equivalent of code like &lt;code&gt;[target message: argument]&lt;/code&gt;, except that the target, the message, and the arguments are all determined entirely at runtime. The &lt;code&gt;NSInvocation&lt;/code&gt; can be constructed in code using runtime introspection without knowing anything about the method ahead of time.&lt;/p&gt;

&lt;p&gt;Furthermore, an &lt;code&gt;NSInvocation&lt;/code&gt; can be constructed &lt;em&gt;from&lt;/em&gt; an attempted message send. If you write &lt;code&gt;[target message: argument]&lt;/code&gt;, and &lt;code&gt;target&lt;/code&gt; doesn't actually implement &lt;code&gt;message:&lt;/code&gt;, then it gets a &lt;code&gt;forwardInvocation:&lt;/code&gt; call, which is given an &lt;code&gt;NSInvocation *&lt;/code&gt; representing the invocation. It can then do whatever it wishes with that invocation, such as invoking it on another object, fiddling with the parameters, or setting an arbitrary return value which is passed back to the caller.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NSInvocation&lt;/code&gt; therefore has two complementary pieces of tricky business:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Code that's able to take a set of arguments, use them to make a method call, and collect the return value.&lt;/li&gt;
&lt;li&gt;Code that's able to receive a method call, collect the arguments, then return an arbitrary return value to the caller.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both pieces require extensive knowlede of the CPU architecture's calling conventions encoded in the implementation, as well as assembly language glue code.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Calling Conventions&lt;/b&gt;&lt;br&gt;Because so much architecture-specific code is needed, I decided to focus on a single architecture. &lt;code&gt;x86-64&lt;/code&gt; is the most convenient one to use for us Mac types. To further simplify things, I decided not to support floating-point arguments or return values, and also gave up on &lt;code&gt;struct&lt;/code&gt; arguments, although I did implement support for &lt;code&gt;struct&lt;/code&gt; return values. The following discussion ignores those parts that I didn't implement.&lt;/p&gt;

&lt;p&gt;In order to implement even this limited &lt;code&gt;MAInvocation&lt;/code&gt;, it's necessary to understand the relevant parts of the &lt;code&gt;x86-64&lt;/code&gt; function calling conventions, and in order to understand that, you must first understand at least a bit of the &lt;code&gt;x86-64&lt;/code&gt; architecture in general.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;x86-64&lt;/code&gt; architecture is a 64-bit extension of Intel's 32-bit &lt;code&gt;x86&lt;/code&gt; architecture introduced with the 386 CPU. That is in turn an extension of the Intel 8086's 16-bit architecture which is in turn heavily based on the 8-bit architecture of the Intel 8080, generally considered to be the first microprocessor worth building a computer around. It could address a whopping 64kB of RAM, just enough to hold one medium-sized app icon these days.&lt;/p&gt;

&lt;p&gt;There are sixteen general-purpose registers: &lt;code&gt;rax&lt;/code&gt;, &lt;code&gt;rbx&lt;/code&gt;, &lt;code&gt;rcx&lt;/code&gt;, &lt;code&gt;rdx&lt;/code&gt;, &lt;code&gt;rbp&lt;/code&gt;, &lt;code&gt;rsp&lt;/code&gt;, &lt;code&gt;rsi&lt;/code&gt;, &lt;code&gt;rdi&lt;/code&gt;, &lt;code&gt;r8&lt;/code&gt;, &lt;code&gt;r9&lt;/code&gt;, &lt;code&gt;r10&lt;/code&gt;, &lt;code&gt;r11&lt;/code&gt;, &lt;code&gt;r12&lt;/code&gt;, &lt;code&gt;r13&lt;/code&gt;, &lt;code&gt;r14&lt;/code&gt;, and &lt;code&gt;r15&lt;/code&gt;. The first half are all inherited from Intel's 32-bit architecture, while the second half are new additions for &lt;code&gt;x86-64&lt;/code&gt;. Each register holds 64 bits.&lt;/p&gt;

&lt;p&gt;Pointers and integers are treated identically when it comes to these calling conventions. Both are simply 64-bit quantities. Smaller integers are extended to 64 bits in size.&lt;/p&gt;

&lt;p&gt;When calling a function, the first six parameters are passed by filling these registers in order: &lt;code&gt;rdi&lt;/code&gt;, &lt;code&gt;rsi&lt;/code&gt;, &lt;code&gt;rdx&lt;/code&gt;, &lt;code&gt;rcx&lt;/code&gt;, &lt;code&gt;r8&lt;/code&gt;, and &lt;code&gt;r9&lt;/code&gt;. Additional arguments, if any, are passed on the stack as 64-bit quantities, so subsequent parameters can be found in memory at &lt;code&gt;rsp&lt;/code&gt;, &lt;code&gt;rsp + 8&lt;/code&gt;, &lt;code&gt;rsp + 16&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;If the function returns a value, that value is returned by storing it in &lt;code&gt;rax&lt;/code&gt;. If the function returns two values, such as when returning a struct like &lt;code&gt;NSRange&lt;/code&gt; that contains two values, &lt;code&gt;rdx&lt;/code&gt; is used for the second one. If the function returns a larger struct, this is handled by having the caller allocate enough memory to hold it, and then a pointer to that memory is passed as an implicit first argument to the function in &lt;code&gt;rdi&lt;/code&gt;, with all of the explicit parameters moved down by one.&lt;/p&gt;

&lt;p&gt;Note that, for Objective-C methods, the first two parameters are &lt;code&gt;self&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt;, which are therefore passed in &lt;code&gt;rdi&lt;/code&gt; and &lt;code&gt;rsi&lt;/code&gt; (or, if the method returns a larger struct, in &lt;code&gt;rsi&lt;/code&gt; and &lt;code&gt;rdx&lt;/code&gt;). The explicit parameters, if any, come after those two.&lt;/p&gt;

&lt;p&gt;As far as I know, there's no particular fundamental reason for the number of registers used to pass parameters, or which ones are used. Calling conventions are a tradeoff between placing a burden on the caller, placing a burden on the callee, making parameter passing more efficient, and making surrounding code more efficient. These conventions presumably sit near some reasonable compromise between all of the competing desires.&lt;/p&gt;

&lt;p&gt;In order to make a function call, &lt;code&gt;MAInvocation&lt;/code&gt; needs to take the parameters to the function, place the first six in the appropriate registers, place any additional ones on the stack, then needs to actually jump to the function's address. Upon return, it needs to record the values in the two return-value registers.&lt;/p&gt;

&lt;p&gt;In order to receive a function call, &lt;code&gt;MAInvocation&lt;/code&gt; needs to record the values of the six parameter-passing registers, as well as the location of the stack pointer, and use these to extract the argument values. Upon returning, it needs to place the desired return values into the two return-value registers. The logic of which values go into registers and the stack can be written in Objective-C, but the code that actually manipulates the registers and the stack needs to be written in assembly.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Data Structure&lt;/b&gt;&lt;br&gt;In order to cleanly communicate between the Objective-C and assembly code, I defined a &lt;code&gt;struct&lt;/code&gt; that contains all of the relevant code. When making a call, &lt;code&gt;MAInvocation&lt;/code&gt; will fill out the &lt;code&gt;struct&lt;/code&gt; as appropriate, then invoke the assembly language glue code. When receiving a call, the assembly language glue code will construct the &lt;code&gt;struct&lt;/code&gt; from the current state, then pass it over to the Objective-C code. Not all fields will be useful in both situations, but it's easier to use the same &lt;code&gt;struct&lt;/code&gt; for everything rather than try to specialize.&lt;/p&gt;

&lt;p&gt;The first thing this &lt;code&gt;struct&lt;/code&gt; contains is the address of the function to call:&lt;/p&gt;

&lt;pre&gt;    struct RawArguments
    {
        void *fptr;
&lt;/pre&gt;

&lt;p&gt;Next, it stores the values of the six 64-bit parameter-passing registers:&lt;/p&gt;

&lt;pre&gt;        uint64_t rdi;
        uint64_t rsi;
        uint64_t rdx;
        uint64_t rcx;
        uint64_t r8;
        uint64_t r9;
&lt;/pre&gt;

&lt;p&gt;It then stores the address of the arguments passed on the stack, as well as how many stack arguments there are:&lt;/p&gt;

&lt;pre&gt;        uint64_t stackArgsCount;
        uint64_t *stackArgs;
&lt;/pre&gt;

&lt;p&gt;After that, it stores the two return-value registers:&lt;/p&gt;

&lt;pre&gt;        uint64_t rax_ret;
        uint64_t rdx_ret;
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;rdx&lt;/code&gt; already exists in the parameter-passing section, but it's easier to make a separate entry for return values than to reuse that field.&lt;/p&gt;

&lt;p&gt;Finally, it keeps a flag that records whether or not the call uses &lt;code&gt;struct&lt;/code&gt; return conventions, i.e. whether the &lt;code&gt;rdi&lt;/code&gt; is used to store a pointer to space allocated for the return value. In Objective-C runtime terminology, such calls are called &lt;code&gt;stret&lt;/code&gt;, short for "&lt;code&gt;struct&lt;/code&gt; return":&lt;/p&gt;

&lt;pre&gt;        uint64_t isStretCall;
    };
&lt;/pre&gt;

&lt;p&gt;"Struct return" is something of a misnomer, since small structs are returned in registers, but that's how it is. When you see "struct return" or "stret", think "sufficiently large struct return".&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Function Call Glue&lt;/b&gt;&lt;br&gt;The function call glue is a function with this C signature:&lt;/p&gt;

&lt;pre&gt;    void MAInvocationCall(struct RawArguments *);
&lt;/pre&gt;

&lt;p&gt;It is implemented in assembly, but with the above prototype, the Objective-C code can call it as if it were a C function. It will pass a filled-out &lt;code&gt;struct RawArguments&lt;/code&gt; and the assembly glue will make the call.&lt;/p&gt;

&lt;p&gt;The assembly code first declares the symbol. It's marked as global so it's accessible from other parts of the program. The leading underscore is due to ancient history involving Fortran, and every C symbol implicitly gets one. A non-C symbol that expects to be accessible from C code needs to have it as well:&lt;/p&gt;

&lt;pre&gt;    .globl _MAInvocationCall
    _MAInvocationCall:
&lt;/pre&gt;

&lt;p&gt;The first thing any well-behaved &lt;code&gt;x86-64&lt;/code&gt; function is save the old frame pointer (stored in &lt;code&gt;rbp&lt;/code&gt;) and set up a new one by copying the stack pointer over:&lt;/p&gt;

&lt;pre&gt;    pushq %rbp
    movq %rsp, %rbp
&lt;/pre&gt;

&lt;p&gt;I'll use &lt;code&gt;r12&lt;/code&gt; through &lt;code&gt;r15&lt;/code&gt; in the following code. These registers are designated as callee-saved by the platform calling conventions, meaning that we're not allowed to just obliterate their contents. Instead, we save their values onto the stack so they can be restored later:&lt;/p&gt;

&lt;pre&gt;    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;struct RawArguments *&lt;/code&gt; parameter is stored in &lt;code&gt;rdi&lt;/code&gt;. It's the first parameter to the function, and the calling conventions state that the first parameter is passed it &lt;code&gt;rdi&lt;/code&gt;. We need to use &lt;code&gt;rdi&lt;/code&gt; for the first parameter to the function being called, so we save the current value into &lt;code&gt;r12&lt;/code&gt;. The various elements of the &lt;code&gt;struct RawArguments&lt;/code&gt; parameter can be accessed by loading various offsets from &lt;code&gt;r12&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    mov %rdi, %r12
&lt;/pre&gt;

&lt;p&gt;Now it's ready to start copying arguments where they need to go. Because this requires manipulating the stack pointer, it copies the stack pointer into &lt;code&gt;r15&lt;/code&gt; so it's easy to restore later:&lt;/p&gt;

&lt;pre&gt;    mov %rsp, %r15
&lt;/pre&gt;

&lt;p&gt;Stack arguments get copied first, for no particular reason. It does make the code to copy them slightly easier to write, as it can use the argument-passing registers as scratch space, since they don't contain anything important. The first thing it does is load the number of stack arguments, which is located at offset &lt;code&gt;56&lt;/code&gt; in the &lt;code&gt;struct Rawarguments&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    movq 56(%r12), %r10
&lt;/pre&gt;

&lt;p&gt;If you're wondering where &lt;code&gt;56&lt;/code&gt; comes from, each member in this struct is &lt;code&gt;8&lt;/code&gt; bytes, and the number of stack arguments is the 8th element in the struct, meaning that it comes after space for &lt;code&gt;7&lt;/code&gt; other elements. &lt;code&gt;7 * 8 = 56&lt;/code&gt;. All the offsets in this code are computed in the same way.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;r10&lt;/code&gt; now contains the number of stack arguments that need to be copied. Next, it computes the amount of stack space needed for these arguments. This is equal to the number of arguments multiplied by 8 (each argument is 64 bits, or 8 bytes). It does this by copying the number of arguments into &lt;code&gt;r11&lt;/code&gt;, then shifting it left by three bits, which is equivalent to multiplying by 8:&lt;/p&gt;

&lt;pre&gt;    movq %r10, %r11
    shlq $3, %r11
&lt;/pre&gt;

&lt;p&gt;Next, it loads the stack argument pointer from offset &lt;code&gt;64&lt;/code&gt; in the &lt;code&gt;struct RawArguments&lt;/code&gt; into &lt;code&gt;r13&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    movq 64(%r12), %r13
&lt;/pre&gt;

&lt;p&gt;Let's take a moment to recap what the temporary registers contain at the moment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r10&lt;/code&gt;: the number of stack arguments to copy.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r11&lt;/code&gt;: the number of bytes needed for stack arguments.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r13&lt;/code&gt;: the stack argument pointer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We don't get to give things convenient names in assembly, so it's essential to keep careful track of what contains what at any given moment.&lt;/p&gt;

&lt;p&gt;The next step is to move the stack pointer down to make room for the arguments, which is done by subtracting &lt;code&gt;r11&lt;/code&gt; from the stack pointer:&lt;/p&gt;

&lt;pre&gt;    subq %r11, %rsp
&lt;/pre&gt;

&lt;p&gt;The stack is also required to be 16-byte aligned before making a function call, and this is done by just doing a logical AND with a value that has the bottom four bits cleared:&lt;/p&gt;

&lt;pre&gt;    andq $-0x10, %rsp
&lt;/pre&gt;

&lt;p&gt;The stage is now set. At this point, we just execute a simple memory copy loop. The equivalent C code would be:&lt;/p&gt;

&lt;pre&gt;    for(int i = 0; i != r10; i++)
        rsp[i] = r13[i];
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;r14&lt;/code&gt; will serve as the loop counter. The first step is to initialize it to zero:&lt;/p&gt;

&lt;pre&gt;    movq $0, %r14
&lt;/pre&gt;

&lt;p&gt;The top of the loop needs a label so that later code can easily jump back to it:&lt;/p&gt;

&lt;pre&gt;    stackargs_loop:
&lt;/pre&gt;

&lt;p&gt;Next comes the check for &lt;code&gt;r14 != r10&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    cmpq %r14, %r10
    je done
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;cmp&lt;/code&gt; instruction compares the two registers and sets the contents of the &lt;code&gt;FLAGS&lt;/code&gt; register accordingly. The &lt;code&gt;je&lt;/code&gt; instruction then jumps to the &lt;code&gt;done&lt;/code&gt; label if the &lt;code&gt;FLAGS&lt;/code&gt; register indicats that the two are equal. This two-stage construct is a bit odd, but it's how &lt;code&gt;x86-64&lt;/code&gt; works.&lt;/p&gt;

&lt;p&gt;If the two aren't equal, the loop continues. The next step is to copy the current argument. This is done in two stages. First, the argument is copied from the memory pointed to by &lt;code&gt;r13&lt;/code&gt; into a temporary register, in this case &lt;code&gt;rdi&lt;/code&gt;. Next, the argument is copied from &lt;code&gt;rdi&lt;/code&gt; into the memory pointed to by &lt;code&gt;rsp&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    movq 0(%r13, %r14, 8), %rdi
    movq %rdi, 0(%rsp, %r14, 8)
&lt;/pre&gt;

&lt;p&gt;The parenthetical expressions are a little scary. &lt;code&gt;x86-64&lt;/code&gt; allows memory references with a bunch of different components, which makes it easier to do computed array dereferences like this. The general form of the expression looks like:&lt;/p&gt;

&lt;pre&gt;    offset(%r1, %r2, elementSize)
&lt;/pre&gt;

&lt;p&gt;This refers to this address:&lt;/p&gt;

&lt;pre&gt;    r1 + r2 * elementSize + offset
&lt;/pre&gt;

&lt;p&gt;This can be thought of as an array dereference. &lt;code&gt;r1&lt;/code&gt; is the array pointer, &lt;code&gt;r2&lt;/code&gt; is the index, &lt;code&gt;elementSize&lt;/code&gt; is the size of each element in the array, and &lt;code&gt;offset&lt;/code&gt; is just a final fixup to apply to the whole result. In short, &lt;code&gt;0(%r13, %r14, 8)&lt;/code&gt; is equivalent to &lt;code&gt;((uint64_t *)r13)[r14]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After that comes the &lt;code&gt;i++&lt;/code&gt;, which has a simple assembly equivalent:&lt;/p&gt;

&lt;pre&gt;    inc %r14
&lt;/pre&gt;

&lt;p&gt;Finally, a jump back to &lt;code&gt;stackargs_loop&lt;/code&gt; completes the loop, with the &lt;code&gt;done&lt;/code&gt; label following it so that execution resumes below once the loop exits:&lt;/p&gt;

&lt;pre&gt;    jmp stackargs_loop

    done:
&lt;/pre&gt;

&lt;p&gt;The stack arguments are now ready to go. All that remains is to copy the register arguments into their actual registers. This is done by writing a sequence of move instructions:&lt;/p&gt;

&lt;pre&gt;    movq 8(%r12), %rdi
    movq 16(%r12), %rsi
    movq 24(%r12), %rdx
    movq 32(%r12), %rcx
    movq 40(%r12), %r8
    movq 48(%r12), %r9
&lt;/pre&gt;

&lt;p&gt;With everything ready, it's time to call the target function. The function pointer is conveniently located right at the location pointed to by &lt;code&gt;r12&lt;/code&gt;, since it's the first element in the &lt;code&gt;struct RawArguments&lt;/code&gt;. This instruction makes the call:&lt;/p&gt;

&lt;pre&gt;    callq *(%r12)
&lt;/pre&gt;

&lt;p&gt;Once the call returns, the return value (if any) is found in &lt;code&gt;rax&lt;/code&gt; and &lt;code&gt;rdx&lt;/code&gt;. The code immediately copies the contents of these registers into the &lt;code&gt;struct RawArguments&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    movq %rax, 72(%r12)
    movq %rdx, 80(%r12)
&lt;/pre&gt;

&lt;p&gt;It's just about done. The only thing that needs to be done, aside from returning, is to restore the values stored in &lt;code&gt;r12&lt;/code&gt;-&lt;code&gt;r15&lt;/code&gt; to whatever the caller had in them. First, the stack pointer needs to be restored to what it was after those registers were pushed onto the stack:&lt;/p&gt;

&lt;pre&gt;    mov %r15, %rsp
&lt;/pre&gt;

&lt;p&gt;Then they're popped off in the opposite order from which they were pushed:&lt;/p&gt;

&lt;pre&gt;    popq %r15
    popq %r14
    popq %r13
    popq %r12
&lt;/pre&gt;

&lt;p&gt;Finally, control is returned to the caller, using a magic combination of instructions which readjust the stack and frame pointer before jumping to the caller's address:&lt;/p&gt;

&lt;pre&gt;    leave
    ret
&lt;/pre&gt;

&lt;p&gt;That takes care of the glue code for function calls. The Objective-C code can now fill out a &lt;code&gt;struct RawArguments&lt;/code&gt; to suit the call being made, then call &lt;code&gt;MAInvocationCall&lt;/code&gt; and pass the pointer to the &lt;code&gt;struct&lt;/code&gt; to make the call.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Forwarding Glue&lt;/b&gt;&lt;br&gt;Capturing a method invocation is called "forwarding" in Objective-C. The runtime has a special forwarding handler, which is called any time an implementation can't be found for a particular selector. In fact, there are two different forwarding handlers: one for normal calls, and one for &lt;code&gt;stret&lt;/code&gt; calls. The forwarding handler needs to know where to find the &lt;code&gt;self&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt; parameters, and the locations of those parameters change for a &lt;code&gt;stret&lt;/code&gt; call, so a bit of specialization is required.&lt;/p&gt;

&lt;p&gt;The strategy here is to have two entry points that call through to a common implementation after making a note of whether or not it's a &lt;code&gt;stret&lt;/code&gt; call. The common implementation then fills out a new &lt;code&gt;struct RawArguments&lt;/code&gt; accordingly and calls into an Objective-C function. Once that function returns, it copies the return value back out into the return value registers, then returns.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;r10&lt;/code&gt; register doesn't contain anything in particular when a function is called, but neither is it required to save the value. This makes it a good spot to store the &lt;code&gt;stret&lt;/code&gt; flag temporarily. The normal forwarding handler will set it to &lt;code&gt;0&lt;/code&gt; before jumping to the common implementation, and the &lt;code&gt;stret&lt;/code&gt; handler will set it to &lt;code&gt;1&lt;/code&gt;. Here's the normal handler in its entirety:&lt;/p&gt;

&lt;pre&gt;    .globl _MAInvocationForward
    _MAInvocationForward:
    movq $0, %r10
    jmp _MAInvocationForwardCommon
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;stret&lt;/code&gt; handler is nearly identical:&lt;/p&gt;

&lt;pre&gt;    .globl _MAInvocationForwardStret
    _MAInvocationForwardStret:
    movq $1, %r10
    jmp _MAInvocationForwardCommon
&lt;/pre&gt;

&lt;p&gt;All the interesting stuff happens in the common handler:&lt;/p&gt;

&lt;pre&gt;    .globl _MAInvocationForwardCommon
    _MAInvocationForwardCommon:
&lt;/pre&gt;

&lt;p&gt;The first thing it does is calculate the location of the stack arguments passed in to the function. The stack arguments start at &lt;code&gt;rsp + 8&lt;/code&gt; from the callee's point of view. The &lt;code&gt;call&lt;/code&gt; instruction issued by the caller pushes the return address onto the stack, which is why stack arguments start right at &lt;code&gt;rsp&lt;/code&gt; from that side of things, but not here. &lt;code&gt;r11&lt;/code&gt; is another convenient register that neither contains anything useful nor needs to be saved, so the code computes the address in that register:&lt;/p&gt;

&lt;pre&gt;    movq %rsp, %r11
    addq $8, %r11
&lt;/pre&gt;

&lt;p&gt;Then the function performs the standard prologue of setting up the frame pointer:&lt;/p&gt;

&lt;pre&gt;    pushq %rbp
    movq %rsp, %rbp
&lt;/pre&gt;

&lt;p&gt;Now it's finally time to construct the &lt;code&gt;struct RawArguments&lt;/code&gt;. This is done by pushing values onto the stack. First, a quick recap of what the various register contain right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r10&lt;/code&gt;: the &lt;code&gt;isStretCall&lt;/code&gt; flag.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r11&lt;/code&gt;: the pointer to the stack arguments.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rdi-r9&lt;/code&gt;: register arguments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The handler uses the &lt;code&gt;pushq&lt;/code&gt; instruction to construct the &lt;code&gt;struct&lt;/code&gt; on the stack. Because it's pushing onto the stack, it needs to push everything in reverse order. Because &lt;code&gt;isStretCall&lt;/code&gt; is the last thing in the &lt;code&gt;struct&lt;/code&gt;, it's the first thing to be pushed:&lt;/p&gt;

&lt;pre&gt;    pushq %r10
&lt;/pre&gt;

&lt;p&gt;The return value registers don't need to contain anything in particular, so it makes space for them by pushing zero twice:&lt;/p&gt;

&lt;pre&gt;    pushq $0
    pushq $0
&lt;/pre&gt;

&lt;p&gt;Next comes the &lt;code&gt;stackArgs&lt;/code&gt; pointer, whose value is currently in &lt;code&gt;r11&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    pushq %r11
&lt;/pre&gt;

&lt;p&gt;After that comes the number of stack arguments. This is not currently known, so the handle just pushes a zero to make room for it. That field will be filled out by the Objective-C code:&lt;/p&gt;

&lt;pre&gt;    pushq $0
&lt;/pre&gt;

&lt;p&gt;Next come the argument registers, which are pushed in reverse order:&lt;/p&gt;

&lt;pre&gt;    pushq %r9
    pushq %r8
    pushq %rcx
    pushq %rdx
    pushq %rsi
    pushq %rdi
&lt;/pre&gt;

&lt;p&gt;The very first field of the &lt;code&gt;struct&lt;/code&gt; is the function pointer. That's not used here, so another zero is pushed to make room for it:&lt;/p&gt;

&lt;pre&gt;    pushq $0
&lt;/pre&gt;

&lt;p&gt;At this point, &lt;code&gt;rsp&lt;/code&gt; now contains a pointer to the newly-built &lt;code&gt;struct RawArguments&lt;/code&gt;. The goal is to call a C function with this prototype:&lt;/p&gt;

&lt;pre&gt;    void MAInvocationForwardC(struct RawArguments *r);
&lt;/pre&gt;

&lt;p&gt;The pointer to the &lt;code&gt;struct&lt;/code&gt; is its only parameter, so that address needs to be moved to &lt;code&gt;rdi&lt;/code&gt;, where the first parameter is passed:&lt;/p&gt;

&lt;pre&gt;    movq %rsp, %rdi
&lt;/pre&gt;

&lt;p&gt;The handler needs to consult the &lt;code&gt;struct&lt;/code&gt; afterwards to extract the return value registers. Since &lt;code&gt;rdi&lt;/code&gt; isn't saved across the function call, and &lt;code&gt;rsp&lt;/code&gt; may be changed when aligning the stack for the call, the handler also copies the address into &lt;code&gt;r12&lt;/code&gt; so it can be used afterwards:&lt;/p&gt;

&lt;pre&gt;    movq %rdi, %r12
&lt;/pre&gt;

&lt;p&gt;It's now time to align the stack and call into Objective-C:&lt;/p&gt;

&lt;pre&gt;    andq $-0x10, %rsp
    callq _MAInvocationForwardC
&lt;/pre&gt;

&lt;p&gt;The Objective-C code will now construct an &lt;code&gt;MAInvocation&lt;/code&gt; instance and invoke the object's &lt;code&gt;forwardInvocation:&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;Once control returns, the return value, if any, is found in the &lt;code&gt;struct&lt;/code&gt;. To make them visible to the caller, that value is copied out of the &lt;code&gt;struct&lt;/code&gt; and into the appropriate registers:&lt;/p&gt;

&lt;pre&gt;    movq 72(%r12), %rax
    movq 80(%r12), %rdx
&lt;/pre&gt;

&lt;p&gt;That's it! Return to the caller:&lt;/p&gt;

&lt;pre&gt;    leave
    ret
&lt;/pre&gt;

&lt;p&gt;The Objective-C runtime's forward handlers are, amazingly, configurable. To set them to this code, all you have to do is call this somewhere convenient:&lt;/p&gt;

&lt;pre&gt;    objc_setForwardHandler(MAInvocationForward, MAInvocationForwardStret);
&lt;/pre&gt;

&lt;p&gt;The runtime will then use these forward handlers for all unimplemented selectors.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;That wraps up the assembly language glue code and the basic knowledge of calling conventions. Much work remains, but the two glue functions here provide the necessary foundation that the Objective-C parts of &lt;code&gt;MAInvocation&lt;/code&gt; can be built on. &lt;code&gt;MAInvocation&lt;/code&gt; needs to manage a &lt;code&gt;struct RawArguments&lt;/code&gt; and translate between the contents of that &lt;code&gt;struct&lt;/code&gt; and the arguments and return values provided and requested by the clients of the API. To make a method call, it needs to arrange the &lt;code&gt;struct&lt;/code&gt; properly, then call into the above glue code. To receive a method call, it needs to construct a new &lt;code&gt;MAInvocation&lt;/code&gt; from the &lt;code&gt;struct&lt;/code&gt; contents.&lt;/p&gt;

&lt;p&gt;All this shall be covered next time. Until then, please &lt;a href="mailto:mike@mikeash.com"&gt;send in your ideas&lt;/a&gt; for topics to cover on Friday Q&amp;amp;A. The next article may be spoken for, but your suggestions for the future are always welcome.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-03-08-lets-build-nsinvocation-part-i.html</guid><pubDate>Fri, 08 Mar 2013 14:33:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-02-22: Let's Build UITableView
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-02-22-lets-build-uitableview.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-02-22: Let's Build UITableView
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 02 22  15 35"
                  tags="fridayqna iphone letsbuild"
            author="Matthew Elton"
            authorlink="http://obliquely.org.uk/"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-02-22: Let's Build UITableView
&lt;/div&gt;
              &lt;p&gt;Friday Q&amp;amp;A is driven by the readers, and that's especially true today. Reader &lt;a href="http://obliquely.org.uk/"&gt;Matthew Elton&lt;/a&gt; thought that "Let's Build UITableView" would make a good topic for Friday Q&amp;amp;A, but he decided he'd rather implement it himself and write it up rather than wait for me to do it (good move, Matthew). Without further ado, here is Matthew's article an building &lt;code&gt;UITableView&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Let's Build UITableView&lt;/b&gt;&lt;br&gt;&lt;code&gt;UITableView&lt;/code&gt; is a powerful and full featured class, but its internal workings can seem mysterious. Most of the time use of the class is straightforward: in return for following the prescribed practice in the documentation, the developer gets a responsive scrolling table that is frugal with memory even when the row count is high. But when pushing the class hard, for example with large tables where each row may have a different and varying height, it helps to have a deeper understanding of how the class works.&lt;/p&gt;

&lt;p&gt;In this article, I am going to implement a basic version of a table view class. This will show how the class works its magic and also show just why &lt;code&gt;UITableView&lt;/code&gt; asks what it does - and when - of its data source and delegate.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Implementation Strategy&lt;/b&gt;&lt;br&gt;&lt;code&gt;UITableView&lt;/code&gt; is a subclass of &lt;code&gt;UIScrollView&lt;/code&gt; and, with the power of that class in place, it takes only a little work to implement a basic version of a table view. Before diving into code, consider two key tasks needed to keep performance high and memory usage low.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The table view is going to need a pool of reusable views for displaying the rows in the table. Why is this?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Consider a table with, say, a 1000 rows. To the user, it looks as if there are a 1000 views all neatly stacked  one upon the other. Although building views is fast and modern devices do have a lot of memory, if the table view has to build and store a 1000 views, there is serious risk of a performance hit. It might mean, for example, that there's a nasty lag before the table first appears. Not good.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fortunately, the table view does not need to take this approach. It only needs to behave &lt;em&gt;as if&lt;/em&gt; it has a 1000 neatly stacked views. In fact, the table view only needs actual views for the rows in the table that are visible at any given time. Typically this is a fairly small number. And, in any case, it's reliably much smaller than a 1000. To make the illusion work, the table view just needs to move a few views around, so they appear in the part of the scroll view that is visible to the user. Then it has to make sure their contents are updated according to the row they are currently representing. Recycling views from a pool, rather than making new views each time they are needed, ensures things happen fast enough for smooth scrolling even on older iOS devices.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The table view is also going to need to know the starting position and height of each row in the table. And, critically, it will need this information before it attempts any layout at all. Why is this?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;First, it needs to know how tall the table is so it can tell the scroll view the size of its contents and, thus, ensure that the scroll bars are the right size, that when the user gets to the bottom of the table they experience the pleasing elastic band effect, and so on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Second, whenever the scroll view moves, the table view needs to figure out how to reposition its reusable views and whether it needs to refresh their contents.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It turns out that the reusable pool is very simple to implement, so we'll do that first. The mechanism for coordinating rows, their offsets and their contents, is only a little trickier. We'll do that second and build it up by stages.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;A Reusable Pool of Views&lt;/b&gt;&lt;br&gt;&lt;code&gt;UITableView&lt;/code&gt; keeps a pool of reusable views, Apple calls it a queue, with each view representing a single row of the table. Often every row of the table is similar but sometimes tables have different types of rows. So &lt;code&gt;UITableView&lt;/code&gt; asks its data source to specify a reuse identifier when working with the pool of reusable views. A reuse identifier is an &lt;code&gt;NSString&lt;/code&gt; that is passed to the &lt;code&gt;UITableView&lt;/code&gt; method &lt;code&gt;dequeueReusableCellWithIdentifier:&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dequeueReusableCellWithIdentifier:&lt;/code&gt; method asks the &lt;code&gt;UITableView&lt;/code&gt; to return a view. With a fresh &lt;code&gt;UITableView&lt;/code&gt; the pool will be empty and the method will return nil. But once a &lt;code&gt;UITableView&lt;/code&gt; is up and running, it may well have views in its pool. If it does and if their reuse identifier matches that specified in the &lt;code&gt;dequeueReusableCellWithIdentifier:&lt;/code&gt; call, then this view is returned.&lt;/p&gt;

&lt;p&gt;If you've used &lt;code&gt;UITableView&lt;/code&gt; at all, you'll be familiar with the standard pattern for using &lt;code&gt;dequeueReusableCellWithIdentifier:&lt;/code&gt;. In your data source, you implement the &lt;code&gt;tableView:cellForRowAtIndexPath:&lt;/code&gt; method to return a view to represent a given row of your table. At the start of the method, you either grab a view from the pool or make a new one. Either way you populate the view with data for the row. Typical codes looks like this:&lt;/p&gt;

&lt;pre&gt;    - (UITableViewCell*) tableView:(UITableView*) tableView cellForRowAtIndexPath:(NSIndexPath*) indexPath
    {
        UITableViewCell* cell = [tableView dequeueReusableCellWithIdentifier: @"standardRow"];
        if (!cell)
        {
            cell = [[UITableViewCell alloc] initWithStyle: UITableViewCellStyleDefault reuseIdentifier: @"standardRow"];
            [cell autorelease];
        }

        [self populateCell: cell forIndexPath: indexPath];
        return cell;
    }
&lt;/pre&gt;

&lt;p&gt;OK, so let's implement &lt;code&gt;dequeueReusableCellWithIdentifier:&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;    - (PGTableViewCell*) dequeueReusableCellWithIdentifier: (NSString*) reuseIdentifier
    {
        PGTableViewCell* poolCell = nil;

        for (PGTableViewCell* tableViewCell in [self reusePool])
        {
            if ([[tableViewCell reuseIdentifier] isEqualToString: reuseIdentifier])
            {
                poolCell = tableViewCell;
                break;
            }
        }

        if (poolCell)
        {
            [poolCell retain];
            [[self reusePool] removeObject: poolCell];
            [poolCell autorelease];
        }

        return poolCell;
    }
&lt;/pre&gt;

&lt;p&gt;In this implementation the reusePool property is an &lt;code&gt;NSMutableArray&lt;/code&gt;. And a &lt;code&gt;PGTableViewCell&lt;/code&gt; is simply a subclass of &lt;code&gt;UIView&lt;/code&gt; that has one additional property, an &lt;code&gt;NSString&lt;/code&gt; called reuseIdentifier. The real &lt;code&gt;UITableViewCell&lt;/code&gt; has lots of extra functionality, but this property is all that's needed for our basic implementation.&lt;/p&gt;

&lt;p&gt;The method assumes that any view in reusePool is available, i.e it is not currently being used to display a visible row. Of course, for the method to do its job, the table view will need to make sure that relevant views are added to it. That is, it'll need to work out when a row is moved off screen and, at that point, add it to the reuse pool.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Gathering Height and Vertical Offset Data&lt;/b&gt;&lt;br&gt;&lt;code&gt;UITableView&lt;/code&gt; cheerfully copes with fixed and variable row heights and our basic implementation will be no different.  If the delegate responds to the &lt;code&gt;tableView:heightForRowAtIndexPath:&lt;/code&gt; method, then a &lt;code&gt;UITableView&lt;/code&gt; will ask the delegate for the height of every row. It gets to know about the number of rows in the table by asking the data source using the &lt;code&gt;tableView:numberOfRowsInSection:&lt;/code&gt; method (which is required) and the optional &lt;code&gt;numberOfSectionsInTableView:&lt;/code&gt; method. (If this method isn't implemented, the table view assumes you have just one section.)&lt;/p&gt;

&lt;p&gt;For our implementation, we will simplify a little. Our table will not have any sections, so we just need to learn about the number of rows. We'll require our data source to implement &lt;code&gt;numberOfRowsInPgTableView&lt;/code&gt;. And, following Apple, we'll offer an optional &lt;code&gt;pgTableView:heightForRow:&lt;/code&gt; method as part of the delegate protocol.&lt;/p&gt;

&lt;pre&gt;    - (void) generateHeightAndOffsetData
    {
        CGFloat currentOffsetY = 0.0;

        BOOL checkHeightForEachRow = [[self delegate] respondsToSelector: @selector(pgTableView:heightForRow:)];

        NSMutableArray* newRowRecords = [NSMutableArray array];

        NSInteger numberOfRows = [[self dataSource] numberOfRowsInPgTableView: self];

        for (NSInteger row = 0; row &amp;lt; numberOfRows; row++)
        {
            PGRowRecord* rowRecord = [[PGRowRecord alloc] init];

            CGFloat rowHeight = checkHeightForEachRow ? [[self delegate] pgTableView: self heightForRow: row] : [self rowHeight];

            [rowRecord setHeight: rowHeight + _pgRowMargin];
            [rowRecord setStartPositionY: currentOffsetY + _pgRowMargin];

            [newRowRecords insertObject: rowRecord atIndex: row];
            [rowRecord release];

            currentOffsetY = currentOffsetY + rowHeight + _pgRowMargin;
        }

        [self setRowRecords: newRowRecords];

        [self setContentSize: CGSizeMake([self bounds].size.width,  currentOffsetY)];
    }
&lt;/pre&gt;

&lt;p&gt;The code builds an array of &lt;code&gt;PGRowRecord&lt;/code&gt; instances that capture what we will need to perform our layout work. A &lt;code&gt;PGRowRecord&lt;/code&gt; records the starting position of a row, its height and - as we'll see later - a pointer to the view that represents the row if the row is visible. The &lt;code&gt;generateHeightAndOffsetData&lt;/code&gt; method has no idea what is visible or not, so it doesn't set the pointer to the view.&lt;/p&gt;

&lt;p&gt;As the code shows, we need to check to see if the delegate is up for providing height information. If it is then we ask it once for each row in the table.&lt;/p&gt;

&lt;p&gt;Arguably, there is room for greater efficiency here. The height of a given row could be derived by subtracting the current starting position from the next starting position (and having some record of the very last starting position, i.e. the start of the row after the last row). And, in addition, for the case of fixed height rows, we could pass on storing start positions and heights altogether, but simply calculate them when needed. But it looks as if we might need the array anyway, because we need to keep track of whether a given row is currently visible. So we'll leave this method as is for now.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Laying Out the Rows&lt;/b&gt;&lt;br&gt;Having gathered start positions and heights, the work of laying out the views is straightforward. The table view fetches its &lt;code&gt;contentOffset&lt;/code&gt;, a property of the &lt;code&gt;UIScrollView&lt;/code&gt; that indicates the start of the visible section of the view. It then figures out the first row that needs to be shown, stepping forward one row at a time until the visible section of the view is filled up.&lt;/p&gt;

&lt;p&gt;The only complexity here is keeping careful track of which rows are displayed so the table view can then check to see if any that were previously visible have now gone. In that case, it will want to put them back in the pool for reuse. The &lt;code&gt;returnNonVisibleRowsToThePool:&lt;/code&gt; method will do this work if required.&lt;/p&gt;

&lt;pre&gt;    - (void) layoutTableRows
    {
        CGFloat currentStartY = [self contentOffset].y;
        CGFloat currentEndY = currentStartY + [self frame].size.height;

        NSInteger rowToDisplay = [self findRowForOffsetY: currentStartY inRange: NSMakeRange(0, [[self rowRecords] count])];

        NSMutableIndexSet* newVisibleRows = [[NSMutableIndexSet alloc] init];

        CGFloat yOrigin;
        CGFloat rowHeight;
        do
        {
            [newVisibleRows addIndex: rowToDisplay];

            yOrigin = [self startPositionYForRow: rowToDisplay];
            rowHeight = [self heightForRow: rowToDisplay];

            PGTableViewCell* cell = [self cachedCellForRow: rowToDisplay];

            if (!cell)
            {
                cell = [[self dataSource] pgTableView: self cellForRow: rowToDisplay];
                [self setCachedCell: cell forRow: rowToDisplay];

                [cell setFrame: CGRectMake(0.0, yOrigin, [self bounds].size.width, rowHeight - _pgRowMargin)];
                [self addSubview: cell];
            }

            rowToDisplay++;
        }
        while (yOrigin + rowHeight &amp;lt; currentEndY &amp;amp;&amp;amp; rowToDisplay &amp;lt; [[self rowRecords] count]);

        [self returnNonVisibleRowsToThePool: newVisibleRows];

        [newVisibleRows release];
    }
&lt;/pre&gt;

&lt;p&gt;This method is going to get called a lot. Every time you scroll the table or, indeed, every time the system does, this method will need to be called. Ensuring that it is is achieved by overriding the superclass &lt;code&gt;setContentOffset:&lt;/code&gt; method as follows.&lt;/p&gt;

&lt;pre&gt;    - (void) setContentOffset:(CGPoint)contentOffset
    {
        [super setContentOffset: contentOffset];
        [self layoutTableRows];
    }
&lt;/pre&gt;

&lt;p&gt;If you are playing with this code, you can put an &lt;code&gt;NSLog&lt;/code&gt; in here to get a feel for the frequency of calls, not least because this drives home the importance of ensuring the &lt;code&gt;layoutTableRows&lt;/code&gt; is fast.&lt;/p&gt;

&lt;p&gt;One thing that could really slow down &lt;code&gt;layoutTableRows&lt;/code&gt; would be  inefficiency in the &lt;code&gt;findRowForOffsetY:inRange&lt;/code&gt; method. So it seemed worth putting a little effort into this.
Because the array of rowRecords is already sorted, we can take advantage of the &lt;code&gt;NSArray&lt;/code&gt; method &lt;code&gt;indexOfObject:inSortedRange:options:usingComparator:&lt;/code&gt;. This performs a binary search to home in on the first row that is needed for the current vertical offset of the UIScrollView. For a table of 6000 rows or so, this method can be 100 times faster than just cranking through the list of rows from the start. That said, after doing some measuring it became clear that even unoptimised iteration is fast enough most of the time. By 'fast enough' here I mean that doing it inefficiently has no discernible impact on the user experience, at least for tables up to 10,000 rows.&lt;/p&gt;

&lt;pre&gt;    - (NSInteger) findRowForOffsetY: (CGFloat) yPosition inRange: (NSRange) range
    {
        if ([[self rowRecords] count] == 0) return 0;

        PGRowRecord* rowRecord = [[PGRowRecord alloc] init];
        [rowRecord setStartPositionY: yPosition];

        NSInteger returnValue = [[self rowRecords] indexOfObject: rowRecord
                                                   inSortedRange: NSMakeRange(0, [[self rowRecords] count])
                                                         options: NSBinarySearchingInsertionIndex
                                                 usingComparator: ^NSComparisonResult(PGRowRecord* rowRecord1, PGRowRecord* rowRecord2){
                if ([rowRecord1 startPositionY] &amp;lt; [rowRecord2 startPositionY])
                    return NSOrderedAscending;
                return NSOrderedDescending;
        }];
        [rowRecord release];
        if (returnValue == 0) return 0;
        return returnValue - 1;
    }
&lt;/pre&gt;

&lt;p&gt;The final method used by &lt;code&gt;layoutTableRows&lt;/code&gt; is &lt;code&gt;returnNonVisibleRowsToThePool:&lt;/code&gt; This makes use of some handy methods provided by the &lt;code&gt;NSMutableIndexSet&lt;/code&gt; class to figure out which, if any rows, are now no longer visible. For all that are, it clears the pointer to the view in the row's &lt;code&gt;PGRowRecord&lt;/code&gt; instance, removes the view from its superview and then adds it into the pool.&lt;/p&gt;

&lt;pre&gt;    - (void) returnNonVisibleRowsToThePool: (NSMutableIndexSet*) currentVisibleRows
    {
        [[self visibleRows] removeIndexes: currentVisibleRows];
        [[self visibleRows] enumerateIndexesUsingBlock:^(NSUInteger row, BOOL *stop)
         {
             PGTableViewCell* tableViewCell = [self cachedCellForRow: row];
             if (tableViewCell)
             {
                 [[self reusePool] addObject: tableViewCell];
                 [tableViewCell removeFromSuperview];
                 [self setCachedCell: nil forRow: row];
             }
         }];
        [self setVisibleRows: currentVisibleRows];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Nearly Done&lt;/b&gt;&lt;br&gt;That's all the hard work. The &lt;code&gt;reloadData&lt;/code&gt; method just uses code we've already written. Because the &lt;code&gt;generateHeightAndOffsetData&lt;/code&gt; method is going to discard the current record of visible cells, the &lt;code&gt;reloadData&lt;/code&gt; method takes care to remove all the currently visible views first.&lt;/p&gt;

&lt;pre&gt;    - (void) reloadData
    {
        [self returnNonVisibleRowsToThePool: nil];
        [self generateHeightAndOffsetData];
        [self layoutTableRows];
    }
&lt;/pre&gt;

&lt;p&gt;The rest is just housekeeping, such as setting up data source and delegate protocols and providing some convenience methods for accessing our array of PGRowRecords. The details are in the full source, along with a bonus &lt;code&gt;row:changedHeight:&lt;/code&gt; method which allows for a row to change height without forcing the delegate to provide new height information for every other row.&lt;/p&gt;

&lt;p&gt;The source includes a small test app so you can see &lt;code&gt;PGTableView&lt;/code&gt; working, showing the text of this article with three different row types: code, headings, and text. The test app lets you turn off the reuse pool so you can see the difference in performance and also lets you run measurements for two variants of findRowForOffset:inRange.&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://github.com/Obliquely/Let-s-Build-UITableView"&gt;https://github.com/Obliquely/Let-s-Build-UITableView&lt;/a&gt; for the source, including the test app. The code here is for learning purposes and is not production tested. If you want to make use of any of the code, please feel free.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;One of the things this exercise reveals is just why &lt;code&gt;UITableView&lt;/code&gt; has to ask for the height of every row when you call the  &lt;code&gt;reloadData&lt;/code&gt; method. When tables have many rows &lt;em&gt;and&lt;/em&gt; when the cost of calculating the height of a row is high, this requirement can be a burden. In such cases, it can make sense to cache row heights so that they don't all have to be recalculated when you or the table view calls &lt;code&gt;reloadData&lt;/code&gt;, e.g. to add an extra row or because the height of one row has changed.  And, in addition, if your table needs to cope with a change orientation that then calls for row height adjustments, you can do work in the background calculating the heights in the orientation you're not in, so that if or when the change comes, the work is already done.&lt;/p&gt;

&lt;p&gt;Given that we know the &lt;code&gt;UITableView&lt;/code&gt; must be keeping its own cache of row heights, it is perhaps mildly annoying that this caching work may need to be done twice. After all, given what we have seen here it seems likely that any implementation would have scope to implement insert, delete, and move methods in such a way that they didn't need to trigger a call &lt;code&gt;tableView:heightForRowAtIndexPath:&lt;/code&gt; on every row. And it seems likely that it would be easy easy for Apple to implement a &lt;code&gt;rowAtIndexPath:changedHeight:&lt;/code&gt; method top cope with growing or shrinking rows. Still, the fully featued UITableView is a mighty class. So perhaps it's not seemly to carp about such a minor detail.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Matthew Elton</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-02-22-lets-build-uitableview.html</guid><pubDate>Fri, 22 Feb 2013 15:35:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-02-08: Let's Build Key-Value Coding
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-02-08-lets-build-key-value-coding.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-02-08: Let's Build Key-Value Coding
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 02 08  14 17"
                  tags="fridayqna letsbuild objectivec"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-02-08: Let's Build Key-Value Coding
&lt;/div&gt;
              &lt;p&gt;&lt;a href="friday-qa-2013-01-25-lets-build-nsobject.html"&gt;Last time&lt;/a&gt;, I showed how to build the basic functionality of &lt;code&gt;NSObject&lt;/code&gt;. I left out key-value coding, because the implementation of &lt;code&gt;valueForKey:&lt;/code&gt; and &lt;code&gt;setValue:forKey:&lt;/code&gt; is complex enough to need its own article. This is that article.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Basics&lt;/b&gt;&lt;br&gt;Key-value coding (KVC) is an API that allows string-based access to object properties. &lt;code&gt;NSObject&lt;/code&gt; implements the methods to look up accessor methods or instance variables based on the key name, and fetch or set the value using those.&lt;/p&gt;

&lt;p&gt;There are two basic methods that form the basis of KVC.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;valueForKey:&lt;/code&gt; method searches for a getter method with the same name as the key. If found, it calls the method and returns its return value. If none is found, it searches for an instance variable with the same name as the key. Failing those, it looks for an instance variable with the same name as the key, but prefixed with an underscore. If an instance variable is found, it returns the value it currently holds.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;setValue:forKey:&lt;/code&gt; method performs the same search, except that it searches for a setter method rather than a getter. It then either calls the setter or sets the instance variable directly.&lt;/p&gt;

&lt;p&gt;An interesting feature of both of these methods is that they work with primitive values by automatically boxing and unboxing them into instances of &lt;code&gt;NSNumber&lt;/code&gt; or &lt;code&gt;NSValue&lt;/code&gt;. You can use &lt;code&gt;valueForKey:&lt;/code&gt; to invoke a method that returns &lt;code&gt;int&lt;/code&gt;, and the result will be an &lt;code&gt;NSNumber&lt;/code&gt; object containing the return value. Likewise, you can use &lt;code&gt;setValue:forKey:&lt;/code&gt; to invoke a method that takes &lt;code&gt;int&lt;/code&gt;, pass it an &lt;code&gt;NSNumber&lt;/code&gt;, and it will automatically extract the integer value.&lt;/p&gt;

&lt;p&gt;KVC also has the concept of key paths, which are sequences of keys put together with periods, like:&lt;/p&gt;

&lt;pre&gt;    foo.bar.baz
&lt;/pre&gt;

&lt;p&gt;There are corresponding methods to work with key paths: &lt;code&gt;valueForKeyPath:&lt;/code&gt; and &lt;code&gt;setValue:forKeyPath:&lt;/code&gt;. These simply call the more primitive methods recursively.&lt;/p&gt;

&lt;p&gt;There are a bunch of other KVC features for managing collections, but these are less interesting and I'm going to skip over them here.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Code&lt;/b&gt;&lt;br&gt;Today's code is available on GitHub as part of the &lt;code&gt;MAObject&lt;/code&gt; project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mikeash/MAObject"&gt;https://github.com/mikeash/MAObject&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's get to it.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;valueForKey:&lt;/b&gt;&lt;br&gt;The first thing that &lt;code&gt;valueForKey:&lt;/code&gt; does is check for a getter method with the same name as the key.&lt;/p&gt;

&lt;pre&gt;    - (id)valueForKey: (NSString *)key
    {
        SEL getterSEL = NSSelectorFromString(key);
        if([self respondsToSelector: getterSEL])
        {
&lt;/pre&gt;

&lt;p&gt;If the object responds to that selector, it will use the accessor to get the value. Exactly how that's done will depend on the accessor's return type. To get ready, it fetches the return type and &lt;code&gt;IMP&lt;/code&gt; for the method:&lt;/p&gt;

&lt;pre&gt;            NSMethodSignature *sig = [self methodSignatureForSelector: getterSEL];
            char type = [sig methodReturnType][0];
            IMP imp = [self methodForSelector: getterSEL];
&lt;/pre&gt;

&lt;p&gt;If the return type is an object or a class, then the code is simple: cast the &lt;code&gt;IMP&lt;/code&gt; to the right function pointer type, call it, and return what it returns:&lt;/p&gt;

&lt;pre&gt;            if(type == @encode(id)[0] || type == @encode(Class)[0])
            {
                return ((id (*)(id, SEL))imp)(self, getterSEL);
            }
&lt;/pre&gt;

&lt;p&gt;Otherwise, the method returns a primitive, which is where things get interesting.&lt;/p&gt;

&lt;p&gt;There is no convenient way to take a function pointer with an arbitrary type, call it, and box up the result. We have to do things the brute-force way, by enumerating all of the possibilities one by one and writing code to handle each possible type. I created a small macro to help with this:&lt;/p&gt;

&lt;pre&gt;            else
            {
                #define CASE(ctype, selectorpart) \
                    if(type == @encode(ctype)[0]) \
                        return [NSNumber numberWith ## selectorpart: ((ctype (*)(id, SEL))imp)(self, getterSEL)];
&lt;/pre&gt;

&lt;p&gt;The idea is that each type gets a single line. You pass the type name as one parameter, and a selector part that fits in with &lt;code&gt;[NSNumber numberWithType:]&lt;/code&gt; as the other parameter. The macro uses these to construct code that checks for the type and calls the &lt;code&gt;IMP&lt;/code&gt; with the right function pointer type if it matches. With this macro, it's just a matter of writing out every supported primitive type:&lt;/p&gt;

&lt;pre&gt;                CASE(char, Char);
                CASE(unsigned char, UnsignedChar);
                CASE(short, Short);
                CASE(unsigned short, UnsignedShort);
                CASE(int, Int);
                CASE(unsigned int, UnsignedInt);
                CASE(long, Long);
                CASE(unsigned long, UnsignedLong);
                CASE(long long, LongLong);
                CASE(unsigned long long, UnsignedLongLong);
                CASE(float, Float);
                CASE(double, Double);
&lt;/pre&gt;

&lt;p&gt;Let's not forget to undefine the &lt;code&gt;CASE&lt;/code&gt; macro so we can reuse the name later:&lt;/p&gt;

&lt;pre&gt;                #undef CASE
&lt;/pre&gt;

&lt;p&gt;If a matching case was found, then the method returned immediately. If the method is still running at this point, then the type isn't known. Rather than try to handle this gracefully somehow, the method just throws an exception to complain:&lt;/p&gt;

&lt;pre&gt;                [NSException raise: NSInternalInconsistencyException format: @"Class %@ key %@ don't know how to interpret method return type from getter, signature is %@", [isa description], key, sig];
            }
        }
&lt;/pre&gt;

&lt;p&gt;That was the code to handle the case where a getter method exists. If no getter exists, then KVC falls back to instance variables. First, it tries to get an instance variable with the same name as the key:&lt;/p&gt;

&lt;pre&gt;        Ivar ivar = class_getInstanceVariable(isa, [key UTF8String]);
&lt;/pre&gt;

&lt;p&gt;If that fails, it tries again with a leading underscore:&lt;/p&gt;

&lt;pre&gt;        if(!ivar)
            ivar = class_getInstanceVariable(isa, [[@"_" stringByAppendingString: key] UTF8String]);
&lt;/pre&gt;

&lt;p&gt;If either of those found an instance variable, it proceeds to actually fetching its value. In order to fetch the contents of the variable, we need to know where it's stored. This is done by getting the variable's offset, and adding it to the value of &lt;code&gt;self&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        if(ivar)
        {
            ptrdiff_t offset = ivar_getOffset(ivar);
            char *ptr = (char *)self;
            ptr += offset;
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;self&lt;/code&gt; is cast to &lt;code&gt;char *&lt;/code&gt; first, because the offset is in bytes, and operating on a &lt;code&gt;char *&lt;/code&gt; ensures that the &lt;code&gt;+=&lt;/code&gt; operation does what we need.&lt;/p&gt;

&lt;p&gt;We also need to know the type of the variable:&lt;/p&gt;

&lt;pre&gt;            const char *type = ivar_getTypeEncoding(ivar);
&lt;/pre&gt;

&lt;p&gt;If the type is an object or class, then it just extracts the value directly and returns it:&lt;/p&gt;

&lt;pre&gt;            const char *type = ivar_getTypeEncoding(ivar);
            if(type[0] == @encode(id)[0] || type[0] == @encode(Class)[0])
            {
                return *(id *)ptr;
            }
&lt;/pre&gt;

&lt;p&gt;Otherwise, it falls back to special cases again. This code uses a slightly different &lt;code&gt;CASE&lt;/code&gt; macro. This one checks the type and then extracts the value from &lt;code&gt;ptr&lt;/code&gt; if there's a match:&lt;/p&gt;

&lt;pre&gt;            else
            {
                #define CASE(ctype, selectorpart) \
                    if(strcmp(type, @encode(ctype)) == 0) \
                        return [NSNumber numberWith ## selectorpart: *(ctype *)ptr];
&lt;/pre&gt;

&lt;p&gt;Once again, there's a long list of supported types:&lt;/p&gt;

&lt;pre&gt;                CASE(char, Char);
                CASE(unsigned char, UnsignedChar);
                CASE(short, Short);
                CASE(unsigned short, UnsignedShort);
                CASE(int, Int);
                CASE(unsigned int, UnsignedInt);
                CASE(long, Long);
                CASE(unsigned long, UnsignedLong);
                CASE(long long, LongLong);
                CASE(unsigned long long, UnsignedLongLong);
                CASE(float, Float);
                CASE(double, Double);
&lt;/pre&gt;

&lt;p&gt;Followed by macro cleanup:&lt;/p&gt;

&lt;pre&gt;                #undef CASE
&lt;/pre&gt;

&lt;p&gt;This code falls back to creating a generic &lt;code&gt;NSValue&lt;/code&gt; with the contents of &lt;code&gt;ptr&lt;/code&gt; if there's no match. Because the data is already laid out in memory, it's trivial to have a fallback here, rather than just throwing an exception like the getter code above does:&lt;/p&gt;

&lt;pre&gt;                return [NSValue valueWithBytes: ptr objCType: type];
            }
        }
&lt;/pre&gt;

&lt;p&gt;Finally, if no getter or instance variable was found, the method throws an exception. The dummy return statement at the end is just to ensure that the compiler doesn't complain about not returning a value:&lt;/p&gt;

&lt;pre&gt;        [NSException raise: NSInternalInconsistencyException format: @"Class %@ is not key-value compliant for key %@", [isa description], key];
        return nil;
    }
&lt;/pre&gt;

&lt;p&gt;That takes care of &lt;code&gt;valueForKey:&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;setValue:forKey:&lt;/b&gt;&lt;br&gt;The &lt;code&gt;setValue:forKey:&lt;/code&gt; method works similarly, but there are some differences due to the fact that it has to set values rather than retrieve them.&lt;/p&gt;

&lt;p&gt;The first thing it does is construct the name of the setter method to search for. &lt;code&gt;valueForKey:&lt;/code&gt; can simply translate the key directly to a selector, but this method needs to do a bit of work. The setter method is generated by capitalizing the first letter of the key, then adding "set" to the beginning, and a colon at the end:&lt;/p&gt;

&lt;pre&gt;    - (void)setValue: (id)value forKey: (NSString *)key
    {
        NSString *setterName = [NSString stringWithFormat: @"set%@:", [key capitalizedString]];
&lt;/pre&gt;

&lt;p&gt;It then turns that into a selector and checks to see if the object responds:&lt;/p&gt;

&lt;pre&gt;        SEL setterSEL = NSSelectorFromString(setterName);
        if([self respondsToSelector: setterSEL])
        {
&lt;/pre&gt;

&lt;p&gt;If it does, it fetches the method's argument type and &lt;code&gt;IMP&lt;/code&gt; much like the getter code above:&lt;/p&gt;

&lt;pre&gt;            NSMethodSignature *sig = [self methodSignatureForSelector: setterSEL];
            char type = [sig getArgumentTypeAtIndex: 2][0];
            IMP imp = [self methodForSelector: setterSEL];
&lt;/pre&gt;

&lt;p&gt;If the type is an object or class, it simply calls the setter, passing &lt;code&gt;value&lt;/code&gt;, and returns:&lt;/p&gt;

&lt;pre&gt;            if(type == @encode(id)[0] || type == @encode(Class)[0])
            {
                ((void (*)(id, SEL, id))imp)(self, setterSEL, value);
                return;
            }
&lt;/pre&gt;

&lt;p&gt;Otherwise, it's once again time for a &lt;code&gt;CASE&lt;/code&gt; macro. This one calls the &lt;code&gt;IMP&lt;/code&gt;, passing &lt;code&gt;[value typeValue]&lt;/code&gt; as the parameter, when a match is found:&lt;/p&gt;

&lt;pre&gt;            else
            {
                #define CASE(ctype, selectorpart) \
                    if(type == @encode(ctype)[0]) { \
                        ((void (*)(id, SEL, ctype))imp)(self, setterSEL, [value selectorpart ## Value]); \
                        return; \
                    }
&lt;/pre&gt;

&lt;p&gt;Here is the big list of cases:&lt;/p&gt;

&lt;pre&gt;                CASE(char, char);
                CASE(unsigned char, unsignedChar);
                CASE(short, short);
                CASE(unsigned short, unsignedShort);
                CASE(int, int);
                CASE(unsigned int, unsignedInt);
                CASE(long, long);
                CASE(unsigned long, unsignedLong);
                CASE(long long, longLong);
                CASE(unsigned long long, unsignedLongLong);
                CASE(float, float);
                CASE(double, double);
&lt;/pre&gt;

&lt;p&gt;Followed by macro cleanup:&lt;/p&gt;

&lt;pre&gt;                #undef CASE
&lt;/pre&gt;

&lt;p&gt;Last, if the type is unknown, it throws an exception:&lt;/p&gt;

&lt;pre&gt;                [NSException raise: NSInternalInconsistencyException format: @"Class %@ key %@ set from incompatible object %@", [isa description], key, value];
            }
        }
&lt;/pre&gt;

&lt;p&gt;If no setter method is found, then it searches for instance variables. No string manipulation is needed, since the instance variable's name doesn't change the way the setter's name does. This code does the same check for instance variables with a leading underscore:&lt;/p&gt;

&lt;pre&gt;        Ivar ivar = class_getInstanceVariable(isa, [key UTF8String]);
        if(!ivar)
            ivar = class_getInstanceVariable(isa, [[@"_" stringByAppendingString: key] UTF8String]);
&lt;/pre&gt;

&lt;p&gt;If the instance variable exists, it creates a pointer to it and gets its type just like &lt;code&gt;valueForKey:&lt;/code&gt; does:&lt;/p&gt;

&lt;pre&gt;        if(ivar)
        {
            ptrdiff_t offset = ivar_getOffset(ivar);
            char *ptr = (char *)self;
            ptr += offset;

            const char *type = ivar_getTypeEncoding(ivar);
&lt;/pre&gt;

&lt;p&gt;If the variable is an object or class pointer, the code can set it directly. Well, nearly directly. There's a minor &lt;code&gt;retain&lt;/code&gt; &lt;code&gt;release&lt;/code&gt; dance to be done in order to ensure that memory management is correct:&lt;/p&gt;

&lt;pre&gt;            if(type[0] == @encode(id)[0] || type[0] == @encode(Class)[0])
            {
                value = [value retain];
                [*(id *)ptr release];
                *(id *)ptr = value;
                return;
            }
&lt;/pre&gt;

&lt;p&gt;Otherwise, &lt;code&gt;value&lt;/code&gt; is boxed, and the primitive value needs to be extracted. If &lt;code&gt;value&lt;/code&gt; is an &lt;code&gt;NSValue&lt;/code&gt; with the &lt;em&gt;exact&lt;/em&gt; same type as the instance variable, the &lt;code&gt;getValue:&lt;/code&gt; method can be used to simply copy the value over directly:&lt;/p&gt;

&lt;pre&gt;            else if(strcmp([value objCType], type) == 0)
            {
                [value getValue: ptr];
                return;
            }
&lt;/pre&gt;

&lt;p&gt;If that doesn't work, it's time to fall back to the last long list of cases. This version of the &lt;code&gt;CASE&lt;/code&gt; macro sets the value at &lt;code&gt;ptr&lt;/code&gt;, appropriately cast, to &lt;code&gt;[value typeValue]&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;            else
            {
                #define CASE(ctype, selectorpart) \
                    if(strcmp(type, @encode(ctype)) == 0) { \
                        *(ctype *)ptr = [value selectorpart ## Value]; \
                        return; \
                    }
&lt;/pre&gt;

&lt;p&gt;The traditional exhaustive enumeration of primitive types follows:&lt;/p&gt;

&lt;pre&gt;                CASE(char, char);
                CASE(unsigned char, unsignedChar);
                CASE(short, short);
                CASE(unsigned short, unsignedShort);
                CASE(int, int);
                CASE(unsigned int, unsignedInt);
                CASE(long, long);
                CASE(unsigned long, unsignedLong);
                CASE(long long, longLong);
                CASE(unsigned long long, unsignedLongLong);
                CASE(float, float);
                CASE(double, double);
&lt;/pre&gt;

&lt;p&gt;Macro cleanup:&lt;/p&gt;

&lt;pre&gt;                #undef CASE
&lt;/pre&gt;

&lt;p&gt;Finally, if none of the cases were hit, throw an exception:&lt;/p&gt;

&lt;pre&gt;                [NSException raise: NSInternalInconsistencyException format: @"Class %@ key %@ set from incompatible object %@", [isa description], key, value];
            }
        }
&lt;/pre&gt;

&lt;p&gt;If neither setter method nor instance variable was found, throw an exception to complain:&lt;/p&gt;

&lt;pre&gt;        [NSException raise: NSInternalInconsistencyException format: @"Class %@ is not key-value compliant for key %@", [isa description], key];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Key Paths&lt;/b&gt;&lt;br&gt;To round out the implementation of KVC, I'll implement &lt;code&gt;valueForKeyPath:&lt;/code&gt; and &lt;code&gt;setValue:forKeyPath:&lt;/code&gt; as well.&lt;/p&gt;

&lt;p&gt;The first thing that &lt;code&gt;valueForKeyPath:&lt;/code&gt; does is look for a &lt;code&gt;.&lt;/code&gt; in the key path. If it doesn't exist, then it's treated as a plain key and passed to &lt;code&gt;valueForKey:&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;    - (id)valueForKeyPath: (NSString *)keyPath
    {
        NSRange range = [keyPath rangeOfString: @"."];
        if(range.location == NSNotFound)
            return [self valueForKey: keyPath];
&lt;/pre&gt;

&lt;p&gt;Otherwise, the key is split into two pieces. The piece up to the &lt;code&gt;.&lt;/code&gt; is the local key, and the following piece is the remainder of the key path:&lt;/p&gt;

&lt;pre&gt;        NSString *key = [keyPath substringToIndex: range.location];
        NSString *rest = [keyPath substringFromIndex: NSMaxRange(range)];
&lt;/pre&gt;

&lt;p&gt;The key is passed to &lt;code&gt;valueForKey:&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;        id next = [self valueForKey: key];
&lt;/pre&gt;

&lt;p&gt;Then &lt;code&gt;valueForKeyPath:&lt;/code&gt; is sent recursively to the &lt;code&gt;next&lt;/code&gt; object:&lt;/p&gt;

&lt;pre&gt;        return [next valueForKeyPath: rest];
    }
&lt;/pre&gt;

&lt;p&gt;Its implementation will decompose &lt;code&gt;rest&lt;/code&gt; further until every &lt;code&gt;.&lt;/code&gt; is consumed. The result is a chain of &lt;code&gt;valueForKey:&lt;/code&gt; calls, returning the result of the very last call.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setValue:forKeyPath:&lt;/code&gt; works similarly. If there's no &lt;code&gt;.&lt;/code&gt; in the key path, call &lt;code&gt;setValue:forKey:&lt;/code&gt; and return:&lt;/p&gt;

&lt;pre&gt;    - (void)setValue: (id)value forKeyPath: (NSString *)keyPath
    {
        NSRange range = [keyPath rangeOfString: @"."];
        if(range.location == NSNotFound)
        {
            [self setValue: value forKey: keyPath];
            return;
        }
&lt;/pre&gt;

&lt;p&gt;Otherwise, extract the key and remainder:&lt;/p&gt;

&lt;pre&gt;        NSString *key = [keyPath substringToIndex: range.location];
        NSString *rest = [keyPath substringFromIndex: NSMaxRange(range)];
&lt;/pre&gt;

&lt;p&gt;Grab the next object using &lt;code&gt;valueForKey:&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;        id next = [self valueForKey: key];
&lt;/pre&gt;

&lt;p&gt;Then recursively send &lt;code&gt;setValue:forKeyPath:&lt;/code&gt; to &lt;code&gt;next&lt;/code&gt;, passing &lt;code&gt;rest&lt;/code&gt; as the key path:&lt;/p&gt;

&lt;pre&gt;        [next setValue: value forKeyPath: rest];
    }
&lt;/pre&gt;

&lt;p&gt;The result is a chain of &lt;code&gt;valueForKey:&lt;/code&gt; calls, culminating in a call to &lt;code&gt;setValue:forKey:&lt;/code&gt; on the last object in the chain.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;You can now see how key-value coding works on the inside. There isn't anything particularly complicated. It's largely just a long list of different things to try. Cocoa's implementation is a bit smarter, and can leverage things like &lt;code&gt;NSInvocation&lt;/code&gt; for more comprehensive coverage, but that's the basic idea. A large part of &lt;code&gt;NSInvocation&lt;/code&gt; is simply baked-in knowledge of all the different cases that need to be handled as well.&lt;/p&gt;

&lt;p&gt;That's it for today. May you code your keys and values in peace. Until next time, since Friday Q&amp;amp;A is driven by reader suggestions, please &lt;a href="mailto:mike@mikeash.com"&gt;send in your ideas for topics&lt;/a&gt;!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-02-08-lets-build-key-value-coding.html</guid><pubDate>Fri, 08 Feb 2013 14:17:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-01-25: Let's Build NSObject
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-01-25-lets-build-nsobject.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-01-25: Let's Build NSObject
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 01 25  15 32"
                  tags="fridayqna letsbuild objectivec"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-01-25: Let's Build NSObject
&lt;/div&gt;
              &lt;p&gt;The &lt;code&gt;NSObject&lt;/code&gt; class lies at the root of (almost) all classes we build and use as part of Cocoa programming. What does it actually do, though, and how does it do it? Today, I'm going to rebuild &lt;code&gt;NSObject&lt;/code&gt; from scratch, as suggested by friend of the blog and occasional guest author Gwynne Raskind.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Components of a Root Class&lt;/b&gt;&lt;br&gt;What exactly does a root class do? In terms of Objective-C itself, there is precisely one requirement: the root class's first instance variable must be &lt;code&gt;isa&lt;/code&gt;, which is a pointer to the object's class. The &lt;code&gt;isa&lt;/code&gt; is used to figure out what class an object is when dispatching messages. That's all there has to be, from a strict language standpoint.&lt;/p&gt;

&lt;p&gt;A root class that only provides that wouldn't be very useful, of course. &lt;code&gt;NSObject&lt;/code&gt; provides a lot more. The functionality it provides can be broken down into three categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Memory management:&lt;/strong&gt; standard memory management methods like &lt;code&gt;retain&lt;/code&gt; and &lt;code&gt;release&lt;/code&gt; are implemented in &lt;code&gt;NSObject&lt;/code&gt;. The &lt;code&gt;alloc&lt;/code&gt; method is also implemented there.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Introspection:&lt;/strong&gt; &lt;code&gt;NSObject&lt;/code&gt; provides a bunch of methods that are essentially wrappers around Objective-C runtime functionality, such as &lt;code&gt;class&lt;/code&gt;, &lt;code&gt;respondsToSelector:&lt;/code&gt;, and &lt;code&gt;isKindOfClass:&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default implementations of miscellaneous methods:&lt;/strong&gt; there are a bunch of methods that we count on every object implementing, such as &lt;code&gt;isEqual:&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;. In order to ensure that every object has an implementation, &lt;code&gt;NSObject&lt;/code&gt; provides a default implementation that every subclass gets if it doesn't bring its own.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;b&gt;Code&lt;/b&gt;&lt;br&gt;I'll be reimplementing &lt;code&gt;NSObject&lt;/code&gt; functionality as &lt;code&gt;MAObject&lt;/code&gt;. I've posted the full code for this article on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mikeash/MAObject"&gt;https://github.com/mikeash/MAObject&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that this code is built without ARC. Although ARC is great and should be used whenever possible, it really gets in the way when implementing a root class, because a root class needs to implement memory management and ARC prefers that you leave memory management up to the compiler.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Instance Variables&lt;/b&gt;&lt;br&gt;&lt;code&gt;MAObject&lt;/code&gt; has two instance variables. The first is the &lt;code&gt;isa&lt;/code&gt; pointer. The second is the object's reference count:&lt;/p&gt;

&lt;pre&gt;    @implementation MAObject {
        Class isa;
        volatile int32_t retainCount;
    }
&lt;/pre&gt;

&lt;p&gt;The reference count will be managed using functions from &lt;code&gt;OSAtomic.h&lt;/code&gt; to ensure thread safety, which is why it has a somewhat unusual definition rather than just using &lt;code&gt;NSUInteger&lt;/code&gt; or similar.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NSObject&lt;/code&gt; actually holds reference counts externally. There's a global table which maps an object's address to its reference count. This saves memory, because the table represents the common reference count of &lt;code&gt;1&lt;/code&gt; by not having an entry in the table at all. However, this technique is complex and a bit slow, so I opted not to follow it for my own version.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Memory Management&lt;/b&gt;&lt;br&gt;The first thing that &lt;code&gt;MAObject&lt;/code&gt; needs to be able to do is to create instances. This is done by implementing the &lt;code&gt;+alloc&lt;/code&gt; method. (I'm skipping the deprecated and rarely used &lt;code&gt;+allocWithZone:&lt;/code&gt;, which these days does the same thing and ignores its parameter anyway.)&lt;/p&gt;

&lt;p&gt;Subclasses rarely override &lt;code&gt;+alloc&lt;/code&gt;, and rely on the root class to allocate memory for them. That means that &lt;code&gt;MAObject&lt;/code&gt; needs to be able to allocate instances not only of &lt;code&gt;MAObject&lt;/code&gt;, but of any subclass. This is done by taking advantage of the fact that the value of &lt;code&gt;self&lt;/code&gt; in a class method is the class the message was actually sent to. If code does &lt;code&gt;[SomeSubclass alloc]&lt;/code&gt;, then &lt;code&gt;self&lt;/code&gt; holds a pointer to &lt;code&gt;SomeSubclass&lt;/code&gt;. That class can then be used to query the runtime to figure out how much memory to allocate, and to set the &lt;code&gt;isa&lt;/code&gt; pointer correctly. The retain count is also initialized to &lt;code&gt;1&lt;/code&gt;, as suits a newly allocated object:&lt;/p&gt;

&lt;pre&gt;    + (id)alloc
    {
        MAObject *obj = calloc(1, class_getInstanceSize(self));
        obj-&amp;gt;isa = self;
        obj-&amp;gt;retainCount = 1;
        return obj;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;retain&lt;/code&gt; method simply uses &lt;code&gt;OSAtomicIncrement32&lt;/code&gt; to bump up the retain count, and returns &lt;code&gt;self&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (id)retain
    {
        OSAtomicIncrement32(&amp;amp;retainCount);
        return self;
    }
&lt;/pre&gt;

&lt;p&gt;The release method does a bit more. It first decrements the retain count. If the retain count was decremented to &lt;code&gt;0&lt;/code&gt;, then the object needs to be destroyed, so the code calls &lt;code&gt;dealloc&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (oneway void)release
    {
        uint32_t newCount = OSAtomicDecrement32(&amp;amp;retainCount);
        if(newCount == 0)
            [self dealloc];
    }
&lt;/pre&gt;

&lt;p&gt;The implementation of &lt;code&gt;autorelease&lt;/code&gt; calls &lt;code&gt;NSAutoreleasePool&lt;/code&gt; to add &lt;code&gt;self&lt;/code&gt; to the current autorelease pool. Autorelease pools are part of the runtime these days, so this is a somewhat indirect route, but the autorelease APIs in the runtime are private, so this is the best we can do for now:&lt;/p&gt;

&lt;pre&gt;    - (id)autorelease
    {
        [NSAutoreleasePool addObject: self];
        return self;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;retainCount&lt;/code&gt; method simply returns the value held in the ivar:&lt;/p&gt;

&lt;pre&gt;    - (NSUInteger)retainCount
    {
        return retainCount;
    }
&lt;/pre&gt;

&lt;p&gt;Finally, there's the &lt;code&gt;dealloc&lt;/code&gt; method. In normal classes, &lt;code&gt;dealloc&lt;/code&gt; needs to clean up any instance variables and then call &lt;code&gt;super&lt;/code&gt;. The root class has to actually dispose of the memory occupied by the object itself. In this case, it's just a simple call to &lt;code&gt;free&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (void)dealloc
    {
        free(self);
    }
&lt;/pre&gt;

&lt;p&gt;There are a couple of helper methods as well. &lt;code&gt;NSObject&lt;/code&gt; provides a do-nothing &lt;code&gt;init&lt;/code&gt; method for consistency, so that subclasses can always call &lt;code&gt;[super init]&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (id)init
    {
        return self;
    }
&lt;/pre&gt;

&lt;p&gt;There's also a &lt;code&gt;new&lt;/code&gt; method, which is just a wrapper around &lt;code&gt;alloc&lt;/code&gt; and &lt;code&gt;init&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    + (id)new
    {
        return [[self alloc] init];
    }
&lt;/pre&gt;

&lt;p&gt;There's also an empty &lt;code&gt;finalize&lt;/code&gt; method. &lt;code&gt;NSObject&lt;/code&gt; implements this as part of its garbage collection support. &lt;code&gt;MAObject&lt;/code&gt; doesn't support garbage collection in the first place, but I included this just because &lt;code&gt;NSObject&lt;/code&gt; has it:&lt;/p&gt;

&lt;pre&gt;    - (void)finalize
    {
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Introspection&lt;/b&gt;&lt;br&gt;Many of the introspection methods are just wrappers around runtime functions. Since that's not too interesting, I'll give a brief discussion of what the runtime function is doing behind the scenes as well.&lt;/p&gt;

&lt;p&gt;The simplest introspection method is &lt;code&gt;class&lt;/code&gt;, which just returns the value of &lt;code&gt;isa&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (Class)class
    {
        return isa;
    }
&lt;/pre&gt;

&lt;p&gt;Technically, this method will fail on tagged pointers. A proper implementation should call &lt;code&gt;object_getClass&lt;/code&gt;, which behaves correctly for tagged pointers, and extracts the &lt;code&gt;isa&lt;/code&gt; from a normal pointer. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;superclass&lt;/code&gt; instance method is equivalent to just invoking the &lt;code&gt;superclass&lt;/code&gt; class method on the object's class, so that's exactly what the method does:&lt;/p&gt;

&lt;pre&gt;    - (Class)superclass
    {
        return [[self class] superclass];
    }
&lt;/pre&gt;

&lt;p&gt;There are also class methods for these. The &lt;code&gt;+class&lt;/code&gt; method just returns &lt;code&gt;self&lt;/code&gt;, which is the class object. This is a little weird, but it's how &lt;code&gt;NSObject&lt;/code&gt; does things. &lt;code&gt;[obj class]&lt;/code&gt; returns the object's class, but &lt;code&gt;[MyClass class]&lt;/code&gt; just returns a pointer to &lt;code&gt;MyClass&lt;/code&gt; itself. It's not consistent, as &lt;code&gt;MyClass&lt;/code&gt; also has a class, which is the &lt;code&gt;MyClass&lt;/code&gt; metaclass, but it's how things are done:&lt;/p&gt;

&lt;pre&gt;    + (Class)class
    {
        return self;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;+superclass&lt;/code&gt; method does what it says. This is implemented by calling &lt;code&gt;class_getSuperclass&lt;/code&gt;, which just grovels around inside the class structure maintained by the runtime and pulls out the pointer to the superclass.&lt;/p&gt;

&lt;pre&gt;    + (Class)superclass
    {
        return class_getSuperclass(self);
    }
&lt;/pre&gt;

&lt;p&gt;There are also methods for querying whether an object's class matches a particular class. The simple one is &lt;code&gt;isMemberOfClass:&lt;/code&gt;, which does a strict check, ignoring subclasses. Its implementation is simple:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)isMemberOfClass: (Class)aClass
    {
        return isa == aClass;
    }
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;isKindOfClass:&lt;/code&gt; method checks subclasses too, so that &lt;code&gt;[subclassInstance isKindOfClass: [Superclass class]]&lt;/code&gt; returns &lt;code&gt;YES&lt;/code&gt;. The output of this method is essentially the same as that of the class method &lt;code&gt;isSubclassOfClass:&lt;/code&gt;, so it just calls through:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)isKindOfClass: (Class)aClass
    {
        return [isa isSubclassOfClass: aClass];
    }
&lt;/pre&gt;

&lt;p&gt;That method gets a bit more interesting. Starting from &lt;code&gt;self&lt;/code&gt;, it walks up the class hierarchy, comparing with the target class at each level. If it finds a match, it returns &lt;code&gt;YES&lt;/code&gt;. If it runs off the top of the class hierarchy without ever finding a match, it returns &lt;code&gt;NO&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    + (BOOL)isSubclassOfClass: (Class)aClass
    {
        for(Class candidate = self; candidate != nil; candidate = [candidate superclass])
            if (candidate == aClass)
                return YES;

        return NO;
    }
&lt;/pre&gt;

&lt;p&gt;It's interesting to note that this check is not particularly efficient. If you call this method on a class that's deep in the class hierarchy, it can take a lot of loop iterations before it returns &lt;code&gt;NO&lt;/code&gt;. Because of that, &lt;code&gt;isKindOfClass:&lt;/code&gt; checks can be quite a lot slower than message sends, and can actually be substantial bottlenecks in certain cases. Just one more reason to avoid them when possible.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;respondsToSelector:&lt;/code&gt; method just calls through to the runtime function &lt;code&gt;class_respondsToSelector&lt;/code&gt;. That, in turn, looks up the selector in the class's method table to see if it has an entry:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)respondsToSelector: (SEL)aSelector
    {
        return class_respondsToSelector(isa, aSelector);
    }
&lt;/pre&gt;

&lt;p&gt;There's a class method, &lt;code&gt;instancesRespondToSelector:&lt;/code&gt;, which is nearly identical. The only difference is passing &lt;code&gt;self&lt;/code&gt;, which is the class in this context, rather than &lt;code&gt;isa&lt;/code&gt;, which would be the metaclass here:&lt;/p&gt;

&lt;pre&gt;    + (BOOL)instancesRespondToSelector: (SEL)aSelector
    {
        return class_respondsToSelector(self, aSelector);
    }
&lt;/pre&gt;

&lt;p&gt;There are also two &lt;code&gt;conformsToProtocol:&lt;/code&gt; methods, one for instances and one for classes. These also just wrap a runtime function, which in this case just consults a table of every protocol that the class conforms to in order to see if the given protocol is present:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)conformsToProtocol: (Protocol *)aProtocol
    {
        return class_conformsToProtocol(isa, aProtocol);
    }

    + (BOOL)conformsToProtocol: (Protocol *)protocol
    {
        return class_conformsToProtocol(self, protocol);
    }
&lt;/pre&gt;

&lt;p&gt;Next is &lt;code&gt;methodForSelector:&lt;/code&gt;, and its classy cousin &lt;code&gt;instanceMethodForSelector:&lt;/code&gt;. These both call through to &lt;code&gt;class_getMethodImplementation&lt;/code&gt;, which looks up the selector in the class's method table and returns the corresponding &lt;code&gt;IMP&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (IMP)methodForSelector: (SEL)aSelector
    {
        return class_getMethodImplementation(isa, aSelector);
    }

    + (IMP)instanceMethodForSelector: (SEL)aSelector
    {
        return class_getMethodImplementation(self, aSelector);
    }
&lt;/pre&gt;

&lt;p&gt;An interesting aspect of these methods is that &lt;code&gt;class_getMethodImplementation&lt;/code&gt; always returns an &lt;code&gt;IMP&lt;/code&gt;, even for unknown selectors. When the class doesn't actually implement a method, it returns a special forwarding IMP which wraps up the message arguments starts down the path to invoking &lt;code&gt;forwardInvocation:&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;methodSignatureForSelector:&lt;/code&gt; method just wraps the equivalent class method:&lt;/p&gt;

&lt;pre&gt;    - (NSMethodSignature *)methodSignatureForSelector: (SEL)aSelector
    {
        return [isa instanceMethodSignatureForSelector: aSelector];
    }
&lt;/pre&gt;

&lt;p&gt;The class method in turn wraps some runtime calls. It first fetches the &lt;code&gt;Method&lt;/code&gt; for the given selector. If it can't be found, then the class doesn't implement that method, and this code returns &lt;code&gt;nil&lt;/code&gt;. Otherwise, it extracts the C string representing the method's types, and wraps the in an &lt;code&gt;NSMethodSignature&lt;/code&gt; object:&lt;/p&gt;

&lt;pre&gt;    + (NSMethodSignature *)instanceMethodSignatureForSelector: (SEL)aSelector
    {
        Method method = class_getInstanceMethod(self, aSelector);
        if(!method)
            return nil;

        const char *types = method_getTypeEncoding(method);
        return [NSMethodSignature signatureWithObjCTypes: types];
    }
&lt;/pre&gt;

&lt;p&gt;Finally, there's &lt;code&gt;performSelector:&lt;/code&gt;, and the two &lt;code&gt;withObject:&lt;/code&gt; variants that take arguments. These aren't strictly introspection, but they fall in the same general category of wrapping lower-level runtime functionality. They simply retrieve the &lt;code&gt;IMP&lt;/code&gt; for the given selector, cast it to the appropriate function pointer type, and call it:&lt;/p&gt;

&lt;pre&gt;    - (id)performSelector: (SEL)aSelector
    {
        IMP imp = [self methodForSelector: aSelector];
        return ((id (*)(id, SEL))imp)(self, aSelector);
    }

    - (id)performSelector: (SEL)aSelector withObject: (id)object
    {
        IMP imp = [self methodForSelector: aSelector];
        return ((id (*)(id, SEL, id))imp)(self, aSelector, object);
    }

    - (id)performSelector: (SEL)aSelector withObject: (id)object1 withObject: (id)object2
    {
        IMP imp = [self methodForSelector: aSelector];
        return ((id (*)(id, SEL, id, id))imp)(self, aSelector, object1, object2);
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Default Implementations&lt;/b&gt;&lt;br&gt;&lt;code&gt;MAObject&lt;/code&gt; provides default implementations of a bunch of methods. We'll start off with default implementations of &lt;code&gt;isEqual:&lt;/code&gt; and &lt;code&gt;hash&lt;/code&gt;, which just use the object's pointer for identity purposes:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)isEqual: (id)object
    {
        return self == object;
    }

    - (NSUInteger)hash
    {
        return (NSUInteger)self;
    }
&lt;/pre&gt;

&lt;p&gt;Any subclasses with a more expansive notion of equality will have to override these methods, but any subclass where an object is only ever equal to itself can just use these implementations.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;description&lt;/code&gt; method is another handy one to have a default implementation. This implementation just generates a string of the form &lt;code&gt;&amp;amp;lt;MAObject: 0xdeadbeef&amp;amp;gt;&lt;/code&gt;, containing the object's class and pointer value.&lt;/p&gt;

&lt;pre&gt;    - (NSString *)description
    {
        return [NSString stringWithFormat: @"&amp;lt;%@: %p&amp;gt;", [self class], self];
    }
&lt;/pre&gt;

&lt;p&gt;The standard for classes is to just return the class name from their own &lt;code&gt;description&lt;/code&gt;, so there's a class method as well that fetches that name from the runtime and returns it:&lt;/p&gt;

&lt;pre&gt;    + (NSString *)description
    {
        return [NSString stringWithUTF8String: class_getName(self)];
    }
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;doesNotRecognizeSelector:&lt;/code&gt; is a lesser-known utility method. It throws an exception to make it look like the object doesn't actually respond to the given selector. This is useful for things like creating override points where subclasses have to implement a particular method:&lt;/p&gt;

&lt;pre&gt;    - (void)subclassesMustOverride
    {
        // pretend we don't actually implement this here
        [self doesNotRecognizeSelector: _cmd];
    }
&lt;/pre&gt;

&lt;p&gt;The code is fairly simple. The only really tricky bit is formatting the method name. We want to display something like &lt;code&gt;-[Class method]&lt;/code&gt;, but class methods need a &lt;code&gt;+&lt;/code&gt; at the front, as in &lt;code&gt;+[Class classMethod]&lt;/code&gt;. To figure out which context it's in, the code checks to see whether &lt;code&gt;isa&lt;/code&gt; is a metaclass. If it is, then &lt;code&gt;self&lt;/code&gt; is a class, and the &lt;code&gt;+&lt;/code&gt; variant should be used. Otherwise, &lt;code&gt;self&lt;/code&gt; is an instance, and the &lt;code&gt;-&lt;/code&gt; variant is used. The rest of the code just raises the appropriate &lt;code&gt;NSException&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (void)doesNotRecognizeSelector: (SEL)aSelector
    {
        char *methodTypeString = class_isMetaClass(isa) ? "+" : "-";
        [NSException raise: NSInvalidArgumentException format: @"%s[%@ %@]: unrecognized selector sent to instance %p", methodTypeString, [[self class] description], NSStringFromSelector(aSelector), self];
    }
&lt;/pre&gt;

&lt;p&gt;Finally, there are a bunch of little methods that either provide obvious answers to obvious questions (e.g. the &lt;code&gt;self&lt;/code&gt; method), exist to let subclasses always safely call &lt;code&gt;super&lt;/code&gt; (e.g. the empty &lt;code&gt;+initialize&lt;/code&gt; method), or exist as override points (e.g. the &lt;code&gt;copy&lt;/code&gt; implementation that throws an exception). None of these are particularly interesting, but I include them for completeness:&lt;/p&gt;

&lt;pre&gt;    - (id)self
    {
        return self;
    }

    - (BOOL)isProxy
    {
        return NO;
    }

    + (void)load
    {
    }

    + (void)initialize
    {
    }

    - (id)copy
    {
        [self doesNotRecognizeSelector: _cmd];
        return nil;
    }

    - (id)mutableCopy
    {
        [self doesNotRecognizeSelector: _cmd];
        return nil;
    }

    - (id)forwardingTargetForSelector: (SEL)aSelector
    {
        return nil;
    }

    - (void)forwardInvocation: (NSInvocation *)anInvocation
    {
        [self doesNotRecognizeSelector: [anInvocation selector]];
    }

    + (BOOL)resolveClassMethod:(SEL)sel
    {
        return NO;
    }

    + (BOOL)resolveInstanceMethod:(SEL)sel
    {
        return NO;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;&lt;code&gt;NSObject&lt;/code&gt; is a big bundle of different functionality, but nothing too strange. Its main function is to handle memory allocation and management so that you can actually create objects. It also provides a bunch of handy override points for methods that every object is expected to support, and wraps a bunch of runtime functions in a nicer API.&lt;/p&gt;

&lt;p&gt;I've skipped over a big piece of functionality provided by &lt;code&gt;NSObject&lt;/code&gt;: key-value coding. This is complex enough that it deserves its own article, so I will come back to that another time.&lt;/p&gt;

&lt;p&gt;That's it for today. Friday Q&amp;amp;A is driven by reader ideas, in case you somehow didn't already know, so please &lt;a href="mailto:mike@mikeash.com"&gt;send in your topic suggestions&lt;/a&gt;. Until next time, don't code anything I wouldn't code.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-01-25-lets-build-nsobject.html</guid><pubDate>Fri, 25 Jan 2013 15:32:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2013-01-11: Mach Exception Handlers
</title><link>http://www.mikeash.com/pyblog/friday-qa-2013-01-11-mach-exception-handlers.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2013-01-11: Mach Exception Handlers
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2013 01 11  14 44"
                  tags="fridayqna guest signal evil mach exception mig debugging"
            author="Landon Fuller"
            authorlink="http://landonf.bikemonkey.org"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2013-01-11: Mach Exception Handlers
&lt;/div&gt;
              &lt;p&gt;This is my first guest Friday Q&amp;amp;A article, dear readers, and I hope it will withstand your scrutiny. Today's topic is on Mach exception handlers, something I've recently spent some time exploring on Mac OS X and iOS for the purpose of &lt;a href="http://code.google.com/p/plcrashreporter"&gt;crash reporting&lt;/a&gt;. 
While there is surprisingly little documentation available about Mach exception handlers, and they're considered by some to be a mystical source of mystery and power, the fact is that they're actually pretty simple to understand at a high level - something I hope to elucidate here. Unfortunately, they're also partially private API on iOS, despite being used in a number of new crash reporting solutions - something I'll touch on in the conclusion.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Signals vs. Exceptions&lt;/b&gt;&lt;br&gt;On most UNIX systems, the only mechanism available for handling crashes (such as dereferencing &lt;code&gt;NULL&lt;/code&gt;, or writing to an unwritable page) are the standard UNIX &lt;a href="http://www.mikeash.com/pyblog/friday-qa-2011-04-01-signal-handling.html"&gt;signal handlers&lt;/a&gt;. When a fatal machine exception is generated, it is caught by the kernel, which then executes a user-space &lt;a href="http://en.wikipedia.org/wiki/Trampoline_(computing%29"&gt;trampoline&lt;/a&gt; within the failing process, executing any function previously registered by that process via &lt;code&gt;sigaction(3)&lt;/code&gt; or &lt;code&gt;signal(3)&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;On OS X, however, a much more versatile API exists: Mach exceptions. Dating back to Avie Tevanian's work on the &lt;a href="http://en.wikipedia.org/wiki/Mach_(kernel%29"&gt;Mach&lt;/a&gt; OS (yes, &lt;em&gt;that&lt;/em&gt; &lt;a href="http://en.wikipedia.org/wiki/Avie_Tevanian"&gt;Avie Tevanian&lt;/a&gt;), Mach exceptions build on &lt;a href="https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KernelProgramming/boundaries/boundaries.html#//apple_ref/doc/uid/TP30000905-CH217-BABDECEG"&gt;Mach IPC/RPC&lt;/a&gt; to provide an alternative to the UNIX signal handler API. The original design of the Mach exception handling facility was first described, as far as I'm aware, in a &lt;a href="ftp://ftp.cs.cmu.edu/project/mach/doc/unpublished/exception.ps"&gt;1988 paper&lt;/a&gt; authored by Avie Tevanian, among others. It remains fairly accurate to this day, and I'd recommend reading it for more details (after finishing this post, of course).&lt;/p&gt;

&lt;p&gt;Mach exceptions differ from UNIX signals in three significant ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exception information is delivered as a Mach message via a Mach IPC port, rather than by the kernel calling into a userspace trampoline.&lt;/li&gt;
&lt;li&gt;Exception handlers may be registered by any process that has the appropriate mach port rights for the target process.&lt;/li&gt;
&lt;li&gt;Exception handlers may be registered for a specific thread, a specific task (process), or for the entire host. The kernel will search for handlers in that order.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These differences introduce a number of properties that can be useful when implementing debuggers and crash reporters, and are what make the Mach API interesting as an alternative to BSD signals.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Exceptions are Messages&lt;/b&gt;&lt;br&gt;The Mach exception API is based on Mach RPC (which is, in itself, based on Mach IPC). There's a lot of confusion around Mach IPC, but at a high-level, it's not too dissimilar to UNIX sockets or other well-known IPC mechanisms that allow one to read/write messages between processes. Mach IPC communication occurs over mach ports, rather than via socket or other traditional UNIX mechanism; mach ports have unique names, and can be shared with other processes. They can be used to send and receive messages containing arbitrary data. There's a bit more complexity involved in their actual use, but conceptually, that's about all you need to know.&lt;/p&gt;

&lt;p&gt;To write a Mach exception handler using raw Mach IPC, you would need to wait for a new exception message by calling &lt;code&gt;mach_msg()&lt;/code&gt; on a Mach port previously registered as an exception handler (how to do this is covered below). The call to &lt;code&gt;mach_msg()&lt;/code&gt; will block until an exception message is received, or the thread is interrupted. Once a message is received, you are free to introspect it for the state of the thread that generated the exception. You can even correct the cause of the crash and restart the failing thread, if you feel like hacking register state at runtime.&lt;/p&gt;

&lt;p&gt;Since exceptions are provided as &lt;em&gt;messages&lt;/em&gt;, rather than by calling a local function, exception messages can be forwarded to the previously registered Mach exception handler, even if that existing handler is completely out-of-process. This means that you can insert an exception handler without disturbing an existing one, whether it's the debugger or Apple's crash reporter. To forward the message to an existing handler, you also use &lt;code&gt;mach_msg()&lt;/code&gt; to send the original message to a previously registered handler's mach port, using the &lt;code&gt;MACH_SEND_MSG&lt;/code&gt; flag.&lt;/p&gt;

&lt;p&gt;However, if you wish to respond the Mach RPC request yourself, rather than forwarding it, you would need to &lt;em&gt;reply&lt;/em&gt; to the message, informing the sender whether or not you &lt;em&gt;handled&lt;/em&gt; the exception. Mach considers an exception &lt;em&gt;handled&lt;/em&gt; if the crashing thread's state has been corrected such that its execution can be resumed. In this case, the kernel does not attempt to find any other exception  handler, and considers the matter settled. However, if you reply to the RPC request informing the sender (usually the kernel) that the exception has not been handled, the sender will then try to find the next applicable Mach exception handler. Remember that the kernel attempts to send exceptions to thread-specific, task-specific, and host-global exception handlers, in that order.&lt;/p&gt;

&lt;p&gt;The fact that a reply is expected from the exception request can be used for interesting purposes. For example, if a debugger has its exception handler called when a breakpoint is hit, it can simply wait to reply to the Mach exception message until (and only if) you request that the debugger continue execution. &lt;/p&gt;

&lt;p&gt;&lt;b&gt;Mach RPC, not IPC&lt;/b&gt;&lt;br&gt;While above I described how one might implement mach exception handling with raw Mach IPC, the fact is that this is not how the interfaces are defined in Mach. Instead, Mach RPC uses an interface description language (called &lt;em&gt;matchmaker&lt;/em&gt; in the &lt;a href="http://www.cs.cmu.edu/afs/cs/project/mach/public/doc/unpublished/mig.ps"&gt;original 1989 paper&lt;/a&gt;), to describe the format of Mach RPC requests (and their replies), and automatically generate code to handle received messages and generate a reply. &lt;/p&gt;

&lt;p&gt;On OS X, the Mach RPC interface descriptions for exception handling - &lt;code&gt;mach_exc.defs&lt;/code&gt; and &lt;code&gt;exc.defs&lt;/code&gt; - are available via &lt;code&gt;/usr/include/mach&lt;/code&gt;. If you include these files in your Xcode project, it will automatically run the &lt;code&gt;mig(1)&lt;/code&gt; tool (Mach Interface Generator), generating headers and C source files necessary to receive and handle Mach exception messages. The &lt;code&gt;exc.defs&lt;/code&gt; file provides an API for working with 32-bit exceptions, whereas the &lt;code&gt;mach_exc.defs&lt;/code&gt; file provides an API for working with 64-bit exceptions. Unfortunately, the Mach RPC defs are not provided on iOS, and only a subset of the necessary generated headers are provided. As a result, it's not possible to implement a fully correct Mach exception handler on iOS without relying on undocumented functionality.&lt;/p&gt;

&lt;p&gt;The code generated by MIG handles two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interpreting incoming RPC messages and calling out to an existing handler function with the decoded data.&lt;/li&gt;
&lt;li&gt;Initialize a response to the RPC messages using the return values from the handler function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The generated code does not handle registering a Mach exception handler, receiving the Mach message, or actually sending the reply. That is the implementor's responsibility. In addition, there are multiple supported exception "behaviors" that provide different sets of information about an exception; it is the implementor's responsibility to provide callback functions for all of them.&lt;/p&gt;

&lt;p&gt;This is best illustrated in the following 64-bit safe code, intended to work with RPC code generated by &lt;code&gt;mach_exc.defs&lt;/code&gt; (I've left out error handling for simplicity):&lt;/p&gt;

&lt;pre&gt;    // Handle EXCEPTION_DEFAULT behavior
    kern_return_t catch_mach_exception_raise (mach_port_t exception_port,
                                               mach_port_t thread,
                                               mach_port_t task, 
                                               exception_type_t exception,
                                               mach_exception_data_t code,
                                               mach_msg_type_number_t codeCnt)
    {
        // Do smart stuff here.
        fprintf(stderr, "My exception handler was called by exception_raise()\n");

        // Inform the kernel that we haven't handled the exception, and the
        // next handler should be called.
        return KERN_FAILURE;
    }

    extern boolean_t mach_exc_server (mach_msg_header_t *msg, mach_msg_header_t *reply);
    static void exception_server (mach_port_t exceptionPort) {
        mach_msg_return_t rt;
        mach_msg_header_t *msg;
        mach_msg_header_t *reply;

        msg = malloc(sizeof(union __RequestUnion__mach_exc_subsystem));
        reply = malloc(sizeof(union __ReplyUnion__mach_exc_subsystem));

        while (1) {
             rt = mach_msg(msg, MACH_RCV_MSG, 0, sizeof(union __RequestUnion__mach_exc_subsystem), exceptionPort, 0, MACH_PORT_NULL);
             assert(rt == MACH_MSG_SUCCESS);

             // Call out to the mach_exc_server generated by mig and mach_exc.defs.
             // This will in turn invoke one of:
             // mach_catch_exception_raise()
             // mach_catch_exception_raise_state()
             // mach_catch_exception_raise_state_identity()
             // .. depending on the behavior specified when registering the Mach exception port.
             mach_exc_server(msg, reply);

             // Send the now-initialized reply
             rt = mach_msg(reply, MACH_SEND_MSG, reply-&amp;gt;msgh_size, 0, MACH_PORT_NULL, 0, MACH_PORT_NULL);
             assert(rt == MACH_MSG_SUCCESS);
        }
    }
&lt;/pre&gt;

&lt;p&gt;You'll note from the example code that our exception handler is called a &lt;em&gt;server&lt;/em&gt;. In Mach RPC parlance, the kernel would be the &lt;em&gt;client&lt;/em&gt;: it issues RPC requests to our exception server, and waits for our reply.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Exception Behaviors&lt;/b&gt;&lt;br&gt;As described above, exception messages come in multiple formats, containing varying types of data. It's the implementor's responsibility to register for the correct behavior; the &lt;code&gt;mig&lt;/code&gt;-generated RPC code will interpret the messages and hand it off to a user-defined function for the specific type. There are three basic behaviors defined by the Mach Exception API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;EXCEPTION_DEFAULT&lt;/code&gt;: Exception messages will contain a reference thread that triggered it. Handled by &lt;code&gt;catch_exception_raise()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EXCEPTION_STATE&lt;/code&gt;: Exception messages will contain the register state of the triggering thread, but not a reference to the thread itself. Handled by &lt;code&gt;catch_exception_raise_state()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EXCEPTION_STATE_IDENTITY&lt;/code&gt;: Exception messages will contain the register state of the triggering thread, as well as a reference to the triggering thread. Handled by &lt;code&gt;catch_exception_raise_state_identity()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition to the above behaviors, an additional variant was added in later OS X releases to support 64-bit safety. The &lt;code&gt;MACH_EXCEPTION_CODES&lt;/code&gt; flag may be set by AND'ing it with any of the listed behaviors, in which case 64-bit safe exception messages will be provided. This flag is used by LLDB/GDB even when targeting 32-bit processes. When using the &lt;code&gt;MACH_EXCEPTION_CODES&lt;/code&gt; flag, one must also use the RPC functions generated by &lt;code&gt;mach_exc.defs&lt;/code&gt;; these use the &lt;code&gt;mach_&lt;/code&gt; prefix for all functions and types.&lt;/p&gt;

&lt;p&gt;Generally speaking, &lt;code&gt;EXCEPTION_DEFAULT&lt;/code&gt; or &lt;code&gt;EXCEPTION_STATE_IDENTITY&lt;/code&gt; are sufficient for most purposes. Since &lt;code&gt;EXCEPTION_DEFAULT&lt;/code&gt; behavior provides a reference to the triggering thread, you can also fetch the thread state that would normally be provided via &lt;code&gt;EXCEPTION_STATE_IDENTITY&lt;/code&gt; via the Mach &lt;code&gt;thread_state()&lt;/code&gt; API.&lt;/p&gt;

&lt;p&gt;When registering your exception handler, you are responsible for requesting the &lt;code&gt;MACH_EXCEPTION_CODES&lt;/code&gt; behavior that matches the RPC implementation (&lt;code&gt;exc.defs&lt;/code&gt; or &lt;code&gt;mach_exc.defs&lt;/code&gt;) that you intend to use.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Putting it Together&lt;/b&gt;&lt;br&gt;It's time to get down to brass tacks: actually registering an mach port to receive exception messages.  As noted above, handlers can be registered for threads, tasks, and the host, and there are different sets of identical APIs for each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;(thread|task|host)_get_exception_ports&lt;/code&gt;: Returns the currently registered set of exception ports.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;(thread|task|host)_set_exception_ports&lt;/code&gt;: Sets the exception port that will be used for all future exceptions.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;(thread|task|host)_swap_exception_ports&lt;/code&gt;: Atomically set a new exception port, and return the current ports. This can be used to avoid race conditions that could otherwise occur if multiple handlers are registered concurrently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To register your handler, you'll need to first allocate a mach port to receive the messages, insert a "send right" to permit sending responses, and then call one of the exception port &lt;code&gt;set()&lt;/code&gt; or &lt;code&gt;swap()&lt;/code&gt; functions to register it as a receiver of exception messages.&lt;/p&gt;

&lt;p&gt;For example (error handling again elided for conciseness):&lt;/p&gt;

&lt;pre&gt;    mach_port_t server_port;
    kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &amp;amp;server_port);
    assert(kr == KERN_SUCCESS);

    kr = mach_port_insert_right(mach_task_self(), &amp;amp;server_port, &amp;amp;server_port, MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);

    kr = task_set_exception_ports(task, EXC_MASK_BAD_ACCESS, server_port, EXCEPTION_DEFAULT|MACH_EXCEPTION_CODES, THREAD_STATE_NONE);
&lt;/pre&gt;

&lt;p&gt;If you wish to preserve the previous exception handlers, &lt;code&gt;task_swap_exception_ports()&lt;/code&gt; should be used in place of &lt;code&gt;task_set_exception_ports()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Mach exception handlers are a very useful tool, and using them requires a fair bit of moving pieces, but hopefully they don't seem dauntingly complex. At the end of the day, mach exceptions are just a simple exception message, coupled with a reply, sent over Mach ports.&lt;/p&gt;

&lt;p&gt;There are some signficiant advantages of the Mach API over signal handlers, including the ability to forward exceptions out-of-process, and handle all exceptions on a completely different stack - something that can be useful when handling an exception triggered by a stack overflow on the target thread.&lt;/p&gt;

&lt;p&gt;If you plan on implementing your own mach exception handler, there are certainly more details worth further investigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When forwarding mach exceptions, you need to send an exception message that matches the previous registered handler's exception flavor. This may mean populating a new Mach exception message with additional thread state.&lt;/li&gt;
&lt;li&gt;It's not strictly necessary to use the MIG-generated &lt;code&gt;exc_server()&lt;/code&gt; or &lt;code&gt;mach_exc_server()&lt;/code&gt; functions for interpreting Mach messages (though it is probably a good idea). Since &lt;code&gt;mig(1)&lt;/code&gt; generates structures that may be used to directly interpret the Mach exception messages, you can do so directly.&lt;/li&gt;
&lt;li&gt;If you forward exception messages for exceptions that occur in your own process, you need to be sure that the target for the reply is not also your own process. Single-stepping debuggers will only resume the thread they wish to step; that means that they won't resume your exception handler's thread, you'll never receive the reply, and the interrupted thread will never resume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lastly, I should highlight that the headers and mach interfaces required to implement a correct mach exception handler on iOS are not available (though they are available and public on Mac OS X). I filed a radar requesting their addition (&lt;code&gt;rdar://12939497&lt;/code&gt;), as well as an Apple DTS support incident to clarify the situation. The radar is still open, but DTS provided the following guidance:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our engineers have reviewed your request and have determined that this would be best handled as a bug report, which you have already filed. There is no documented way of accomplishing this, nor is there a workaround possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the meantime, as far as I can determine through my own work, and as per DTS's feedback, it's not possible to implement Mach exception handling on iOS using only public API. Hopefully this will be resolved in a future release of iOS, such that we can safely adopt Mach exceptions.&lt;/p&gt;

&lt;p&gt;Thus concludes my first contribution to Friday Q&amp;amp;A. If you have any questions, &lt;a href="mailto:landonf@bikemonkey.org"&gt;feel free to drop me an e-mail&lt;/a&gt;. If I got anything terrible wrong, feel free to roast me in the comments.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Landon Fuller</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2013-01-11-mach-exception-handlers.html</guid><pubDate>Fri, 11 Jan 2013 14:44:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-12-28: What Happens When You Load a Byte of Memory
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-12-28-what-happens-when-you-load-a-byte-of-memory.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-12-28: What Happens When You Load a Byte of Memory
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 12 28  14 45"
                  tags="fridayqna hardware memory"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-12-28: What Happens When You Load a Byte of Memory
&lt;/div&gt;
              &lt;p&gt;The hardware and software that our apps run on is almost frighteningly complicated, and there's no better place to see that than in the contortions that the system goes through when we load data from memory. What exactly happens when we load a byte of memory? Reader and friend of the blog Guy English suggested I dedicate an article to answering that question.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Code&lt;/b&gt;&lt;br&gt;Let's start with the code that loads the byte of memory. In C, it would look something like this:&lt;/p&gt;

&lt;pre&gt;    char *addr = ...;
    char value = *addr;
&lt;/pre&gt;

&lt;p&gt;On &lt;code&gt;x86-64&lt;/code&gt;, this compiles to something like:&lt;/p&gt;

&lt;pre&gt;    movsbl (%rdi),%eax
&lt;/pre&gt;

&lt;p&gt;This instructs the CPU to load the byte located at the address stored in &lt;code&gt;%rdi&lt;/code&gt; into the &lt;code&gt;%eax&lt;/code&gt; register. On ARM, the compiler produces:&lt;/p&gt;

&lt;pre&gt;    ldrsb.w r0, [r0]
&lt;/pre&gt;

&lt;p&gt;Although the instruction name is different, the effect is basically the same. It loads the byte located at the address stored in &lt;code&gt;r0&lt;/code&gt;, and puts the value into &lt;code&gt;r0&lt;/code&gt;. (The compiler is reusing &lt;code&gt;r0&lt;/code&gt; here, since the address isn't needed anymore.)&lt;/p&gt;

&lt;p&gt;Now that the CPU has its instruction, the software is done. Well, maybe.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Instruction Decoding and Execution&lt;/b&gt;&lt;br&gt;I don't want to go too in depth with how the CPU actually executes code in general. In short, the CPU loads the above instruction from memory and decodes it to figure out the opcode and operands. Once it sees that it's a load instruction, it issues the memory load at the appropriate address.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Virtual Memory&lt;/b&gt;&lt;br&gt;On most hardware you're likely to program for today, and on any Apple platform from the past couple of decades, the system uses virtual memory. In short, virtual memory disconnects the memory addresses seen by your program from the physical memory addresses of the actual RAM in your computer. In other words, when your program accesses address &lt;code&gt;42&lt;/code&gt;, that might actually access the physical RAM address &lt;code&gt;977305&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This mapping is done by &lt;em&gt;page&lt;/em&gt;. Each page is a 4kB chunk of memory. The overhead of tracking virtual address mappings for every byte in memory would be far too great, so pages are mapped instead. They're small enough to provide decent granularity, but large enough to not incur too much overhead in maintaining the mapping.&lt;/p&gt;

&lt;p&gt;Modern virtual memory systems also have the ability to set permissions on a page. A page may be readable, writeable, or executable, or some combination thereof. If the program tries to do something with a page that it isn't allowed, or tries to access a page that has no mapping at all, the program is suspended and a fault is raised with the operating system. The OS can then take further action, such as killing the program and generating a crash report, which is what happens when you experience the common &lt;code&gt;EXC_BAD_ACCESS&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;The hardware that handles this work is called the Memory Management Unit, or MMU. The MMU intercepts all memory accesses and remaps the address according to the current page mappings.&lt;/p&gt;

&lt;p&gt;The first thing that happens when the CPU loads a byte of memory is to hand the address to the MMU for translation. (This is not always true. On some CPUs, there is a layer of cache that comes before the MMU. However, the overall principle remains.)&lt;/p&gt;

&lt;p&gt;The first thing the MMU does with the address is slice off the bottom 12 bits, leaving a plain page address. 2&lt;sup&gt;12&lt;/sup&gt; equals 4096, so the bottom 12 bits describe the address's location within its page. Once the rest of the address is remapped, the bottom 12 bits can be added on to generate the full physical address.&lt;/p&gt;

&lt;p&gt;With the page address in hand, the MMU consults the Translation Lookaside Buffer, or TLB. The TLB is a cache for page mappings. If the page in question has been accessed recently, the TLB will remember the mapping, and quickly return the physical page address, at which point the MMU's work is done.&lt;/p&gt;

&lt;p&gt;When the TLB does not contain an entry for the given page, this is called a TLB miss, and the entry must be found by searching the entire page table. The page table is a chunk of memory that describes every page mapping in the current process. Most commonly, the page table is laid out in memory by the OS in a special format that the MMU can understand directly. Following a TLB miss, the MMU searches the page table for the appropriate entry. If it finds one, it loads it into the TLB and performs the remapping.&lt;/p&gt;

&lt;p&gt;On some architectures, the page table mapping is left entirely up to the OS. When a TLB miss occurs, the CPU passes control to the OS, which is then responsible for looking up the mapping and filling the TLB with it. This is more flexible but much slower, and isn't found much in modern hardware.&lt;/p&gt;

&lt;p&gt;If no entry is found in the page table, that means the given address doesn't exist in RAM at all. The CPU informs the OS, which then decides how to handle the situation. If the OS doesn't think that address is valid, it terminates the program and you get an &lt;code&gt;EXC_BAD_ACCESS&lt;/code&gt;. In some cases, the OS does think the address is valid, but just doesn't have the data in RAM. This can happen if the data has been swapped out to disk, is part of a memory mapped file, or is freshly allocated with backing memory being provided on demand. In these cases, the OS loads the appropriate data into RAM, adds an entry to the page table, and then lets the MMU translate the virtual address into a physical address now that the backing data is available.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Cache&lt;/b&gt;&lt;br&gt;With the address in hand, the CPU consults its memory cache. In days of yore, the CPU would talk directly to RAM. However, CPU speeds have increased faster than memory speeds, and that's no longer practical. If a modern CPU had to talk directly to modern RAM for every memory access, our computers would slow to a relative crawl.&lt;/p&gt;

&lt;p&gt;The cache is a hardware map from a set of memory addresses to memory contents. Caches are organized into cache lines, which are typically in the region of 32-128 bytes each. Each entry in the cache holds an address and a single cache line corresponding to that address. When loading data from the cache, it checks to see if the requested address exists in the cache, and if so, returns the appropriate data from that address's cache line.&lt;/p&gt;

&lt;p&gt;There are typically several levels of cache. Due to hardware design constraints, larger caches are necessarily slower. By having multiple levels, a small, fast cache can be checked first, with slower, larger caches used later to avoid the cost of fetching from RAM. The CPU first checks with the L1 cache, which is the first level. This cache is small, typically around 16-64kB. If it contains the data in question, then the memory load is complete! Since that's boring, we'll assume the caches don't contain the data being loaded here.&lt;/p&gt;

&lt;p&gt;Next up is the L2 cache. This is bigger, generally anywhere from 256kB to several megabytes. In some CPUs, the L2 cache is the last level, and these typically have larger L2 caches. Other CPUs have an L3 cache as well, in which case the L2 is usually smaller, and it's supplemented by a large L3 cache, usually several megabytes, with some high performance chips having up to 20MB of L3 cache.&lt;/p&gt;

&lt;p&gt;Once all levels of cache have been tried, if none of them contain the necessary data, it's time to try main memory. Because caches work with entire cache lines, the entire cache line is loaded from main memory at once, even though we're only loading a single byte. This greatly increases efficiency in the common case of accessing other nearby memory, since subsequent nearby loads can come from cache, at the cost of wasting time when memory use is scattered.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Memory&lt;/b&gt;&lt;br&gt;It's finally time to start querying RAM. The CPU has been waiting quite a while by this point, and will have to wait a long time more before it gets the data it wants.&lt;/p&gt;

&lt;p&gt;The load is handed off to the memory controller, which is the bit of hardware that actually knows how to talk to RAM. On a lot of modern hardware, the memory controller is integrated directly into the CPU, while on some systems it's part of a separate chip called the "northbridge".&lt;/p&gt;

&lt;p&gt;The memory controller then starts loading data from RAM. Modern SDRAM transfers 64 bits of data at a time, so several transfers have to be done to fill the entire cache line being requested.&lt;/p&gt;

&lt;p&gt;The memory controller places the load address on the address pins of the RAM and waits for the data to be returned. Internally, the RAM uses the values on the address pins to activate a row of memory cells, whose contents are then exposed on the RAM's output pins.&lt;/p&gt;

&lt;p&gt;RAM is not instantaneous, and there's an appreciable delay between when the memory controller requests an address and when the data is available, on the order of 10 nanoseconds in current hardware. It takes more time to perform the subsequent loads needed for the cache line, but the loads can be pipelined, so total transfer time is maybe 50% more.&lt;/p&gt;

&lt;p&gt;As the memory controller obtains data from RAM, it hands that data back to the caches, which store it in case other data from the same cache line is needed soon. Finally, the requested byte is handed to the CPU, which places the data into the register requested by the instruction. At last, after all of this work, the CPU can get on with running the code that needed that byte of data.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Consequences&lt;/b&gt;&lt;br&gt;There are a lot of practical consequences that result from how all of this stuff works. In particular, memory acccess is &lt;em&gt;slow&lt;/em&gt;, relatively speaking. It's amazing that your computer can do all of the above work literally tens of millions of times per second, but it can do other things literally billions of times per second. Everything is relative.&lt;/p&gt;

&lt;p&gt;The total time required for all of this, assuming a TLB hit (the fast case for the MMU) is a couple of dozen nanoseconds. On a 2GHz CPU, that could mean something like 50 clock cycles with the potential to execute perhaps 150 instructions in that time. That's a lot. A TLB miss may double or triple this latency number.&lt;/p&gt;

&lt;p&gt;Modern CPUs are pipelined and parallelized. This means that they will likely see the need for the memory read ahead of time and initiate the load at that point, softening the blow. Parallel execution means that the CPU will probably be able to continue executing some code &lt;em&gt;after&lt;/em&gt; the load instruction while waiting for the load, especially code that doesn't depend on the loaded value. However, this stuff has limits, and finding 150 instructions that can be executed while waiting for RAM is a tall order. You're almost certain to hit a point where program execution has to stop and wait for the memory load to complete.&lt;/p&gt;

&lt;p&gt;Incidentally, this is where hyperthreading gains its advantage. Instead of having an entire CPU core just idle while waiting for RAM, hyperthreading lets it switch over to a completely different thread of execution and run code from &lt;em&gt;that&lt;/em&gt; instead, so that it can still get useful work done while it waits.&lt;/p&gt;

&lt;p&gt;Access patterns are &lt;em&gt;key&lt;/em&gt; to performance. Discussions about micro-optimization tend to center on using some instructions rather than others, avoiding divisions, etc. Relatively few talk about memory access patterns. However, it doesn't matter how optimized your individual instructions are if they're operating on memory that's loaded in a way that isn't kind to the memory system. Saving a few cycles here and there is meaningless if you're waiting dozens of cycles for every new piece of data to load. For example, this is why, although it's the more natural way to express it, you should never write loops to access image data like this:&lt;/p&gt;

&lt;pre&gt;    for(int x = 0; x &amp;lt; width; x++)
        for(int y = 0; y &amp;lt; height; y++)
            // use the pixel at x, y
&lt;/pre&gt;

&lt;p&gt;Images are typically laid out in contiguous rows, and this loop does &lt;em&gt;not&lt;/em&gt; take advantage of that fact. It accesses columns, only coming back to the next pixel in the first row after loading the entire first column. This causes cache and TLB misses. This loop will be vastly slower than if you iterate over rows first, then columns:&lt;/p&gt;

&lt;pre&gt;    for(int y = 0; y &amp;lt; height; y++)
        for(int x = 0; x &amp;lt; width; x++)
            // use the pixel at x, y
&lt;/pre&gt;

&lt;p&gt;In many cases, the top loop with fast code in the loop body will be massively outperformed by the bottom loop with slow code in the loop body, simply because memory access delays can be so punishing.&lt;/p&gt;

&lt;p&gt;To make things even worse, profilers, such as Apple's Time Profiler in Instruments, aren't good at showing these delays. They'll tell you what instructions took time, but because of the pipelined, parallel nature of modern CPUs, the instruction that takes the hit of the memory load may not be the actual load instruction. The CPU will hit the load instruction, mark its destination register as not having its data yet, and move on. When the CPU hits an instruction that actually needs that register's value, &lt;em&gt;then&lt;/em&gt; it will stop and wait. The clue here is when the first instruction in a sequence of manipulations on the same value takes &lt;em&gt;far&lt;/em&gt; longer than the rest, and far longer than it should. For example, if you have code that does &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;add&lt;/code&gt;, &lt;code&gt;mul&lt;/code&gt;, &lt;code&gt;add&lt;/code&gt;, and the profiler says that the first &lt;code&gt;add&lt;/code&gt; takes the vast majority of the time, this is likely to be a memory delay, not actually a slow &lt;code&gt;add&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Modern computers operate on time scales that are difficult to envision. To a human, the time required for a single CPU cycle and the time required to perform a hard disk seek are both indistingusihably instantaneous, yet they vary by many orders of magnitude. The computer is an incredibly complicated system that requires a huge number of things to happen in order to load a single chunk of data from memory. Knowing what goes on in the hardware when this happens is fascinating and can even help write better code. It's even more incredible once you think that this complicated set of operations happens literally millions of times every second in the computer you're using to read this.&lt;/p&gt;

&lt;p&gt;That's it for today. Check back next time for another exploration of the trans-mundane. If you somehow didn't already know, Friday Q&amp;amp;A is driven by reader submissions. By "reader" I mean &lt;em&gt;you&lt;/em&gt;, so if you have a topic that you'd like to see covered, please &lt;a href="mailto:mike@mikeash.com"&gt;send it in&lt;/a&gt;.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-12-28-what-happens-when-you-load-a-byte-of-memory.html</guid><pubDate>Fri, 28 Dec 2012 14:45:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-12-14: Objective-C Pitfalls
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-12-14-objective-c-pitfalls.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-12-14: Objective-C Pitfalls
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 12 14  14 38"
                  tags="fridayqna objectivec cocoa"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-12-14: Objective-C Pitfalls
&lt;/div&gt;
              &lt;p&gt;Objective-C is a powerful and extremely useful language, but it's also a bit dangerous. For today's article, my colleague Chris Denter suggested that I talk about pitfalls in Objective-C and Cocoa, inspired by Cay S. Horstmann's &lt;a href="http://www.horstmann.com/cpp/pitfalls.html"&gt;article on C++ pitfalls&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Introduction&lt;/b&gt;&lt;br&gt;I'll use the same definition as Horstmann: a pitfall is code that compiles, links, runs, but doesn't do what you might expect it to. He provides this example, which is just as problematic in Objective-C as it is in C++:&lt;/p&gt;

&lt;pre&gt;    if (-0.5 &amp;lt;= x &amp;lt;= 0.5) return 0;
&lt;/pre&gt;

&lt;p&gt;A naive reading of this code would be that it checks to see whether &lt;code&gt;x&lt;/code&gt; is in the range [-0.5, 0.5]. However, that's not the case. Instead, the comparison gets evaluated like this:&lt;/p&gt;

&lt;pre&gt;    if ((-0.5 &amp;lt;= x) &amp;lt;= 0.5)
&lt;/pre&gt;

&lt;p&gt;In C, the value of a comparison expression is an &lt;code&gt;int&lt;/code&gt;, either &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;1&lt;/code&gt;, a legacy from when C had no built-in boolean type. It is that &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;1&lt;/code&gt;, not the value of &lt;code&gt;x&lt;/code&gt;, that is compared with 0.5. In effect, the second comparison works as an extremely weirdly phrased negation operator, such that the if statement's body will execute if and only if &lt;code&gt;x&lt;/code&gt; is less than -0.5.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Nil Comparison&lt;/b&gt;&lt;br&gt;Objective-C is highly unusual in that sending messages to &lt;code&gt;nil&lt;/code&gt; does nothing and simply returns &lt;code&gt;0&lt;/code&gt;. In nearly every other language you're likely to encounter, the equivalent is either prohibited by the type system or produces a runtime error. This can be both good and bad. Given the subject of the article, we'll concentrate on the bad.&lt;/p&gt;

&lt;p&gt;First, let's look at equality testing:&lt;/p&gt;

&lt;pre&gt;    [nil isEqual: @"string"]
&lt;/pre&gt;

&lt;p&gt;Messaging &lt;code&gt;nil&lt;/code&gt; returns &lt;code&gt;0&lt;/code&gt;, which in this case is equivalent to &lt;code&gt;NO&lt;/code&gt;. That happens to be the correct answer here, so we're off to a good start! However, consider this:&lt;/p&gt;

&lt;pre&gt;    [nil isEqual: nil]
&lt;/pre&gt;

&lt;p&gt;This &lt;em&gt;also&lt;/em&gt; returns &lt;code&gt;NO&lt;/code&gt;. It doesn't matter that the argument is the exact same value. The argument's value doesn't matter &lt;em&gt;at all&lt;/em&gt;, because messages to &lt;code&gt;nil&lt;/code&gt; always return &lt;code&gt;0&lt;/code&gt; no matter what. So going by &lt;code&gt;isEqual:&lt;/code&gt;, &lt;code&gt;nil&lt;/code&gt; never equals anything, including itself. Mostly right, but not always.&lt;/p&gt;

&lt;p&gt;Finally, consider one more permutation with &lt;code&gt;nil&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    [@"string" isEqual: nil]
&lt;/pre&gt;

&lt;p&gt;What does this do? Well, we can't be sure. It may return &lt;code&gt;NO&lt;/code&gt;. It may throw an exception. It may simply crash. Passing &lt;code&gt;nil&lt;/code&gt; to a method that doesn't explicitly say it's allowed is a bad idea, and &lt;code&gt;isEqual:&lt;/code&gt; doesn't say that it accepts &lt;code&gt;nil&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Many Cocoa classes also include a &lt;code&gt;compare:&lt;/code&gt; method. This takes another object of the same class and returns either &lt;code&gt;NSOrderedAscending&lt;/code&gt;, &lt;code&gt;NSOrderedSame&lt;/code&gt;, or &lt;code&gt;NSOrderedDescending&lt;/code&gt;, to indicate less than, equal, or greater than.&lt;/p&gt;

&lt;p&gt;What happens if we compare with &lt;code&gt;nil&lt;/code&gt;?&lt;/p&gt;

&lt;pre&gt;    [nil compare: nil]
&lt;/pre&gt;

&lt;p&gt;This returns &lt;code&gt;0&lt;/code&gt;, which happens to be equal to &lt;code&gt;NSOrderedSame&lt;/code&gt;. Unlike &lt;code&gt;isEqual:&lt;/code&gt;, &lt;code&gt;compare:&lt;/code&gt; thinks &lt;code&gt;nil&lt;/code&gt; equals &lt;code&gt;nil&lt;/code&gt;. Handy! However:&lt;/p&gt;

&lt;pre&gt;    [nil compare: @"string"]
&lt;/pre&gt;

&lt;p&gt;This &lt;em&gt;also&lt;/em&gt; returns &lt;code&gt;NSOrderedSame&lt;/code&gt;, which is definitely the wrong answer. &lt;code&gt;compare:&lt;/code&gt; will consider &lt;code&gt;nil&lt;/code&gt; to be equal to anything and everything.&lt;/p&gt;

&lt;p&gt;Finally, just like &lt;code&gt;isEqual:&lt;/code&gt;, passing &lt;code&gt;nil&lt;/code&gt; as the parameter is a bad idea:&lt;/p&gt;

&lt;pre&gt;    [@"string" compare: nil]
&lt;/pre&gt;

&lt;p&gt;In short, be careful with &lt;code&gt;nil&lt;/code&gt; and comparisons. It really just doesn't work right. If there's any chance your code will encounter &lt;code&gt;nil&lt;/code&gt;, you &lt;em&gt;must&lt;/em&gt; check for and handle it separately before you start doing &lt;code&gt;isEqual:&lt;/code&gt; or &lt;code&gt;compare:&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Hashing&lt;/b&gt;&lt;br&gt;You write a little class to contain some data. You have multiple equivalent instances of this class, so you implement &lt;code&gt;isEqual:&lt;/code&gt; so that those instances will be treated as equal. Then you start adding your objects to an &lt;code&gt;NSSet&lt;/code&gt; and things start behaving strangely. The set claims to hold multiple objects after you just added one. It can't find stuff you just added. It may even crash or corrupt memory.&lt;/p&gt;

&lt;p&gt;This can happen if you implement &lt;code&gt;isEqual:&lt;/code&gt; but don't implement &lt;code&gt;hash&lt;/code&gt;. A lot of Cocoa code requires that if two objects compare as equal, they will also have the same hash. If you only override &lt;code&gt;isEqual:&lt;/code&gt;, you violate that requirement. Any time you override &lt;code&gt;isEqual:&lt;/code&gt;, &lt;em&gt;always&lt;/em&gt; override &lt;code&gt;hash&lt;/code&gt; at the same time. For more information, see my article on &lt;a href="friday-qa-2010-06-18-implementing-equality-and-hashing.html"&gt;Implementing Equality and Hashing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Macros&lt;/b&gt;&lt;br&gt;Imagine you're writing some unit tests. You have a method that's supposed to return an array containing a single object, so you write a test to verify that:&lt;/p&gt;

&lt;pre&gt;    STAssertEqualObjects([obj method], @[ @"expected" ], @"Didn't get the expected array");
&lt;/pre&gt;

&lt;p&gt;This uses the new literals syntax to keep things short. Nice, right?&lt;/p&gt;

&lt;p&gt;Now we have another method that returns &lt;em&gt;two&lt;/em&gt; objects, so we write a test for that:&lt;/p&gt;

&lt;pre&gt;    STAssertEqualObjects([obj methodTwo], @[ @"expected1", @"expected2" ], @"Didn't get the expected array");
&lt;/pre&gt;

&lt;p&gt;Suddenly, the code fails to compile and produces completely bizarre errors. What's going on?&lt;/p&gt;

&lt;p&gt;What's going on is that &lt;code&gt;STAssertEqualObjects&lt;/code&gt; is a macro. Macros are expanded by the preprocessor, and the preprocessor is an ancient and fairly dumb program that doesn't know anything about modern Objective-C syntax, or for that matter modern C syntax. The preprocessor splits macro arguments on commas. It's smart enough to know that parentheses can nest, so this is seen as three arguments:&lt;/p&gt;

&lt;pre&gt;    Macro(a, (b, c), d)
&lt;/pre&gt;

&lt;p&gt;Where the first argument is &lt;code&gt;a&lt;/code&gt;, the second is &lt;code&gt;(b, c)&lt;/code&gt;, and the third is &lt;code&gt;d&lt;/code&gt;. However, the preprocessor has no idea that it should do the same thing for &lt;code&gt;[]&lt;/code&gt; and &lt;code&gt;{}&lt;/code&gt;. With the above macro, the preprocessor sees &lt;em&gt;four&lt;/em&gt; arguments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;[obj methodTwo]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@[ @"expected1"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@"expected2 ]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@"Didn't get the expected array"&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This results in completely mangled code that not only doesn't compile, but confuses the compiler beyond the ability to provide understandable diagnostics. The solution is easy, once you know what the problem is. Just parenthesize the literal so the preprocessor treats it as one argument:&lt;/p&gt;

&lt;pre&gt;    STAssertEqualObjects([obj methodTwo], (@[ @"expected1", @"expected2" ]), @"Didn't get the expected array");
&lt;/pre&gt;

&lt;p&gt;Unit tests are where I've run into this most frequently, but it can pop up any time there's a macro. Objective-C literals will fall victim, as will C compound literals. Blocks can also be problematic if you use the comma operator within them, which is rare but legal. You can see that Apple thought about this problem with their &lt;code&gt;Block_copy&lt;/code&gt; and &lt;code&gt;Block_release&lt;/code&gt; macros in &lt;code&gt;/usr/include/Block.h&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    #define Block_copy(...) ((__typeof(__VA_ARGS__))_Block_copy((const void *)(__VA_ARGS__)))
    #define Block_release(...) _Block_release((const void *)(__VA_ARGS__))
&lt;/pre&gt;

&lt;p&gt;These macros conceptually take a single argument, but they're declared to take variable arguments to avoid this problem. By taking &lt;code&gt;...&lt;/code&gt; and using &lt;code&gt;__VA_ARGS__&lt;/code&gt; to refer to "the argument", multiple "arguments" with commas are reproduced in the macro's output. You can take the same approach to make your own macros safe from this problem, although it only works on the last argument of a multi-argument macro.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Property Synthesis&lt;/b&gt;&lt;br&gt;Take the following class:&lt;/p&gt;

&lt;pre&gt;    @interface MyClass : NSObject {
        NSString *_myIvar;
    }

    @property (copy) NSString *myIvar;

    @end

    @implementation MyClass

    @synthesize myIvar;

    @end
&lt;/pre&gt;

&lt;p&gt;Nothing wrong with this, right? The ivar declaration and &lt;code&gt;@synthesize&lt;/code&gt; are a little redundant in this modern age, but do no harm.&lt;/p&gt;

&lt;p&gt;Unfortunately, this code will &lt;em&gt;silently&lt;/em&gt; ignore &lt;code&gt;_myIvar&lt;/code&gt; and synthesize a &lt;em&gt;new&lt;/em&gt; variable called &lt;code&gt;myIvar&lt;/code&gt;, without the leading underscore. If you have code that uses the ivar directly, it will see a different value from code that uses the property. Confusion!&lt;/p&gt;

&lt;p&gt;The rules for &lt;code&gt;@synthesize&lt;/code&gt; variable names are a little weird. If you specify a variable name with &lt;code&gt;@synthesize myIvar = _myIvar;&lt;/code&gt;, then of course it uses whatever you specify. If you leave out the variable name, then it synthesizes a variable with the same name as the property. If you leave out &lt;code&gt;@synthesize&lt;/code&gt; altogether, then it synthesizes a variable with the same name as the property, &lt;em&gt;but with a leading underscore&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Unless you need to support 32-bit Mac, your best bet these days is to just avoid explicitly declaring backing ivars for properties. Let &lt;code&gt;@synthesize&lt;/code&gt; create the variable, and if you get the name wrong, you'll get a nice compiler error instead of mysterious behavior.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Interrupted System Calls&lt;/b&gt;&lt;br&gt;Cocoa code usually sticks to higher level constructs, but sometimes it's useful to drop down a bit and do some &lt;code&gt;POSIX&lt;/code&gt;. For example, this code will write some data to a file descriptor:&lt;/p&gt;

&lt;pre&gt;    int fd;
    NSData *data = ...;

    const char *cursor = [data bytes];
    NSUInteger remaining = [data length];

    while(remaining &amp;gt; 0) {
        ssize_t result = write(fd, cursor, remaining);
        if(result &amp;lt; 0)
        {
            NSLog(@"Failed to write data: %s (%d)", strerror(errno), errno);
            return;
        }
        remaining -= result;
        cursor += result;
    }
&lt;/pre&gt;

&lt;p&gt;However, this can fail, and it will fail strangely and intermittently. POSIX calls like this can be interrupted by signals. Even harmless signals handled elsewhere in the app like &lt;code&gt;SIGCHLD&lt;/code&gt; or &lt;code&gt;SIGINFO&lt;/code&gt; can cause this. &lt;code&gt;SIGCHLD&lt;/code&gt; can occur if you're using &lt;code&gt;NSTask&lt;/code&gt; or are otherwise working with subprocesses. When &lt;code&gt;write&lt;/code&gt; is interrupted by a signal, it returns &lt;code&gt;-1&lt;/code&gt; and sets &lt;code&gt;errno&lt;/code&gt; to &lt;code&gt;EINTR&lt;/code&gt; to indicate that the call was interrupted. The above code treats all errors as fatal and will bail out, even though the call just needs to be tried again. The correct code checks for that separately and just retries the call:&lt;/p&gt;

&lt;pre&gt;    while(remaining &amp;gt; 0) {
        ssize_t result = write(fd, cursor, remaining);
        if(result &amp;lt; 0 &amp;amp;&amp;amp; errno == EINTR)
        {
            continue;
        }
        else if(result &amp;lt; 0)
        {
            NSLog(@"Failed to write data: %s (%d)", strerror(errno), errno);
            return;
        }
        remaining -= result;
        cursor += result;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;String Lengths&lt;/b&gt;&lt;br&gt;The same string, represented differently, can have different lengths. This is a relatively common but incorrect pattern:&lt;/p&gt;

&lt;pre&gt;    write(fd, [string UTF8String], [string length]);
&lt;/pre&gt;

&lt;p&gt;The problem is that &lt;code&gt;NSString&lt;/code&gt; computes length in terms of UTF-16 code units, while &lt;code&gt;write&lt;/code&gt; wants a count of bytes. While the two numbers are equal when the string only contains ASCII (which is why people so frequently get away with writing this incorrect code), they're no longer equal once the string contains non-ASCII characters such as accented characters. Always compute the length of the same representation you're manipulating:&lt;/p&gt;

&lt;pre&gt;    const char *cStr = [string UTF8String];
    write(fd, cStr, strlen(cStr));
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Casting to BOOL&lt;/b&gt;&lt;br&gt;Take this bit of code that just checks to see whether an object pointer is &lt;code&gt;nil&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)hasObject
    {
        return (BOOL)_object;
    }
&lt;/pre&gt;

&lt;p&gt;This works... usually. However, roughly 6% of the time, it will return &lt;code&gt;NO&lt;/code&gt; even though &lt;code&gt;_object&lt;/code&gt; is not &lt;code&gt;nil&lt;/code&gt;. What gives?&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;BOOL&lt;/code&gt; type is, unfortunately, not a boolean. Here's how it's defined:&lt;/p&gt;

&lt;pre&gt;    typedef signed char BOOL;
&lt;/pre&gt;

&lt;p&gt;This is another bit of unfortunate legacy from the days when C had no boolean type. Cocoa predates C99's &lt;code&gt;_Bool&lt;/code&gt;, so it defines its "boolean" type as a &lt;code&gt;signed char&lt;/code&gt;, which is just an 8-bit integer. When you cast a pointer to an integer, you just get the numeric value of that pointer. When you cast a pointer to a small integer, you just get the numeric value of the lower bits of that pointer. When the pointer looks like this:&lt;/p&gt;

&lt;pre&gt;    ....110011001110000
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;BOOL&lt;/code&gt; gets this:&lt;/p&gt;

&lt;pre&gt;               01110000
&lt;/pre&gt;

&lt;p&gt;This is not &lt;code&gt;0&lt;/code&gt;, meaning that it evaluates as true, so what's the problem? The problem is when the pointer looks like this:&lt;/p&gt;

&lt;pre&gt;    ....110011000000000
&lt;/pre&gt;

&lt;p&gt;Then the &lt;code&gt;BOOL&lt;/code&gt; gets this:&lt;/p&gt;

&lt;pre&gt;               00000000
&lt;/pre&gt;

&lt;p&gt;This is &lt;code&gt;0&lt;/code&gt;, also known as &lt;code&gt;NO&lt;/code&gt;, even though the pointer wasn't &lt;code&gt;nil&lt;/code&gt;. Oops!&lt;/p&gt;

&lt;p&gt;How often does this happen? There are &lt;code&gt;256&lt;/code&gt; possible values in the &lt;code&gt;BOOL&lt;/code&gt;, only one of which is &lt;code&gt;NO&lt;/code&gt;, so we'd naively expect it to happen about 1/256 of the time. However, Objective-C objects are allocated aligned, normally to &lt;code&gt;16&lt;/code&gt; bytes. This means that the bottom four bits of the pointer are always zero (something that &lt;a href="friday-qa-2012-07-27-lets-build-tagged-pointers.html"&gt;tagged pointers&lt;/a&gt; takes advantage of) and there are only four bits of freedom in the resulting &lt;code&gt;BOOL&lt;/code&gt;. The odds of getting all zeroes there are about 1/16, or about 6%.&lt;/p&gt;

&lt;p&gt;To safely implement this method, perform an explicit comparison against &lt;code&gt;nil&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)hasObject
    {
        return _object != nil;
    }
&lt;/pre&gt;

&lt;p&gt;If you want to get clever and unreadable, you can also use the &lt;code&gt;!&lt;/code&gt; operator twice. This &lt;code&gt;!!&lt;/code&gt; construct is sometimes referred to as C's "convert to boolean" operator, although it's just built from parts:&lt;/p&gt;

&lt;pre&gt;    - (BOOL)hasObject
    {
        return !!_object;
    }
&lt;/pre&gt;

&lt;p&gt;The first &lt;code&gt;!&lt;/code&gt; produces &lt;code&gt;1&lt;/code&gt; or &lt;code&gt;0&lt;/code&gt; depending on whether &lt;code&gt;_object&lt;/code&gt; is &lt;code&gt;nil&lt;/code&gt;, but backwards. The second &lt;code&gt;!&lt;/code&gt; then puts it right, resulting in &lt;code&gt;1&lt;/code&gt; if &lt;code&gt;_object&lt;/code&gt; is not &lt;code&gt;nil&lt;/code&gt;, and &lt;code&gt;0&lt;/code&gt; if it is.&lt;/p&gt;

&lt;p&gt;You should probably stick to the &lt;code&gt;!= nil&lt;/code&gt; version.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Missing Method Argument&lt;/b&gt;&lt;br&gt;Let's say you're implementing a table view data source. You add this to your class's methods:&lt;/p&gt;

&lt;pre&gt;    - (id)tableView:(NSTableView *) objectValueForTableColumn:(NSTableColumn *)aTableColumn row:(NSInteger)rowIndex
    {
        return [dataArray objectAtIndex: rowIndex];
    }
&lt;/pre&gt;

&lt;p&gt;Then you run your app and &lt;code&gt;NSTableView&lt;/code&gt; complains that you haven't implemented this method. But it's right there!&lt;/p&gt;

&lt;p&gt;As usual, the computer is correct. The computer is your friend.&lt;/p&gt;

&lt;p&gt;Look closer. The first parameter is &lt;em&gt;missing&lt;/em&gt;. Why does this even &lt;em&gt;compile?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It turns out that Objective-C allows empty selector segments. The above does not declare a method named &lt;code&gt;tableView:objectValueForTableColumn:row:&lt;/code&gt; with a missing argument name. It declares a method named &lt;code&gt;tableView::row:&lt;/code&gt;, and the first argument is named &lt;code&gt;objectValueForTableColumn&lt;/code&gt;. This is a particularly nasty way to typo the name of a method, and if you do it in a context where the compiler can't warn you about the missing method, you may be trying to debug it for a long time.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Objective-C and Cocoa have plenty of pitfalls ready to trap the unwary programmer. The above is just a sampling. However, it's a good list of things to be careful of.&lt;/p&gt;

&lt;p&gt;That's it for today! Check back next time for more wacky advice. Friday Q&amp;amp;A is driven by user ideas, in case you didn't already know, so until next time, please &lt;a href="mailto:mike@mikeash.com"&gt;send in your ideas for articles&lt;/a&gt;!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-12-14-objective-c-pitfalls.html</guid><pubDate>Fri, 14 Dec 2012 14:38:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-11-30: Let's Build A Mach-O Executable
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-11-30-lets-build-a-mach-o-executable.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-11-30: Let's Build A Mach-O Executable
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 11 30  17 59"
                  tags="assembly macho fridayqna guest evil dwarf letsbuild"
            author="Gwynne Raskind"
            authorlink="http://www.darkrainfall.org/"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-11-30: Let's Build A Mach-O Executable
&lt;/div&gt;
              &lt;p&gt;This is something of a followup to my last article, &lt;a href="friday-qa-2012-11-09-dyld-dynamic-linking-on-os-x.html"&gt;dyld: Dynamic Linking On OS X&lt;/a&gt;, in which I explored how the dynamic linker &lt;code&gt;dyld&lt;/code&gt; does its job. This week, I'm going to recreate the function of both the compiler and the &lt;em&gt;static&lt;/em&gt; linker, building a Mach-O binary completely from scratch with only the help of the assembler.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The Right Tool For the Right Job&lt;/b&gt;&lt;br&gt;The best tool on OS X for producing binary files from assembly-language inputs is, of course, the assembler, &lt;code&gt;as&lt;/code&gt;. But, if you try to build a raw binary from this, you'll find that &lt;code&gt;as&lt;/code&gt; also functions as a static linker in its own right. This isn't what we're after.&lt;/p&gt;

&lt;p&gt;A more flexible tool, in this particular respect, is &lt;code&gt;nasm&lt;/code&gt;, the &lt;a href="http://nasm.us/"&gt;Netwide Assembler&lt;/a&gt;. &lt;code&gt;nasm&lt;/code&gt; is installed by the Xcode commandline tools, but unfortunately, Apple ships a horrifyingly outdated version, 0.98.40, which dates back to 2007 in terms of bug fixes, and to 1999 for features. The most recent version at the time of this writing is 2.10.05, which can be installed with &lt;code&gt;port install nasm&lt;/code&gt;, &lt;code&gt;brew install nasm&lt;/code&gt;, or whatever other package manager of your choice. If you don't use a package manager, you can download and compile the source yourself.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nasm&lt;/code&gt; 2.x includes a number of useful things, like 64-bit support, and Mach-O output. We won't be using &lt;code&gt;nasm&lt;/code&gt;'s Mach-O support, since the point of all this is to do it by hand, but it'd be kind of nice to build a 64-bit binary using 64-bit instructions instead of split 32-bit words!&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Reinserting the Prime Program&lt;/b&gt;&lt;br&gt;Here's the C source code for which we'll build our Mach-O binary. To keep the resulting binary relatively simple, I've written it to avoid importing more than the bare minimum of information:&lt;/p&gt;

&lt;pre&gt;    #define NULL ((void *)0L)
    extern int printf(const char * restrict format, ...);
    typedef long time_t;
    extern time_t time(time_t *sloc);

    int main(void)
    {
        printf("Hello, world #%ld!\n", time(NULL));
        return 0;
    }
&lt;/pre&gt;

&lt;p&gt;Some things to notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rather than &lt;code&gt;#include &amp;amp;lt;stdio.h&amp;gt;&lt;/code&gt; and &lt;code&gt;#include &amp;amp;lt;time.h&amp;gt;&lt;/code&gt;, I've manually declared &lt;code&gt;printf()&lt;/code&gt; and &lt;code&gt;time()&lt;/code&gt;, defined the &lt;code&gt;time_t&lt;/code&gt; type, and macroed &lt;code&gt;NULL&lt;/code&gt;. This avoids emitting extra debug information for the various stuff defined in the standard headers.&lt;/li&gt;
&lt;li&gt;I've defined &lt;code&gt;main()&lt;/code&gt; as taking no parameters. This is extremely poor practice in general, but because of C's calling conventions, it works correctly.&lt;/li&gt;
&lt;li&gt;I've used a format string that actually does a format replacement so that the compiler with which I produced my test files doesn't get all efficient and replace it with a &lt;code&gt;puts()&lt;/code&gt; call instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This generates the following assembly (built with Clang 3.3svn at &lt;code&gt;-Os&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;            .section        __TEXT,__text,regular,pure_instructions
            .globl  _main
    _main:                                  ## @main
            .cfi_startproc
    ## BB#0:                                ## %entry
            pushq   %rbp
    Ltmp2:
            .cfi_def_cfa_offset 16
    Ltmp3:
            .cfi_offset %rbp, -16
            movq    %rsp, %rbp
    Ltmp4:
            .cfi_def_cfa_register %rbp
            xorl    %edi, %edi
            callq   _time
            leaq    L_.str(%rip), %rdi
            movq    %rax, %rsi
            xorb    %al, %al
            callq   _printf
            xorl    %eax, %eax
            popq    %rbp
            ret
            .cfi_endproc

            .section        __TEXT,__cstring,cstring_literals
    L_.str:                                 ## @.str
            .asciz   "Hello, world #%ld!\n"

    .subsections_via_symbols
&lt;/pre&gt;

&lt;p&gt;The code itself is very straightforward: Inside the &lt;code&gt;__TEXT,__text&lt;/code&gt; section, set up a stack frame, call &lt;code&gt;time()&lt;/code&gt;, load the &lt;code&gt;L_.str&lt;/code&gt; string, set &lt;code&gt;al&lt;/code&gt; to zero, call &lt;code&gt;printf&lt;/code&gt;, zero &lt;code&gt;eax&lt;/code&gt;, tear down the stack frame, and return. Then, in the &lt;code&gt;__TEXT,__cstring&lt;/code&gt; section, define the &lt;code&gt;L_.str&lt;/code&gt; label to point to a zero-terminated ASCII string. Finally, declare that no symbols in this file occur inside basic blocks, which the linker uses during dead code stripping.&lt;/p&gt;

&lt;p&gt;The rest of the directives are related to Call Frame Information, which is used for unwinding data ('.unwind_info' and &lt;code&gt;.eh_frame&lt;/code&gt;, exception handling support) and debug information (&lt;code&gt;.debug_frame&lt;/code&gt;). We'll be building the first two by hand.&lt;/p&gt;

&lt;p&gt;For sanity's sake, I'll be omitting the full DWARF debugging information. Even for this very simple program it would represent a considerable addition to this already overlong article.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The Start of a Mach-O Executable&lt;/b&gt;&lt;br&gt;Our &lt;code&gt;nasm&lt;/code&gt; input file will be used to generate a Mach-O file, so we need to start it with a Mach-O header. We'll use the 64-bit Mach-O little-endian format, whose header looks like this:&lt;/p&gt;

&lt;pre&gt;    struct mach_header_64 {
        uint32_t    magic;      /* mach magic number identifier */
        cpu_type_t  cputype;    /* cpu specifier */
        cpu_subtype_t   cpusubtype; /* machine specifier */
        uint32_t    filetype;   /* type of file */
        uint32_t    ncmds;      /* number of load commands */
        uint32_t    sizeofcmds; /* the size of all the load commands */
        uint32_t    flags;      /* flags */
        uint32_t    reserved;   /* reserved */
    };

    /* Constant for the magic field of the mach_header_64 (64-bit architectures) */
    #define MH_MAGIC_64 0xfeedfacf /* the 64-bit mach magic number */
    #define MH_CIGAM_64 0xcffaedfe /* NXSwapInt(MH_MAGIC_64) */
&lt;/pre&gt;

&lt;p&gt;Here's the &lt;code&gt;nasm&lt;/code&gt; input for our Mach-O header:&lt;/p&gt;

&lt;pre&gt;    bits 64
    cpu x64

    __mh_execute_header:
        dd 0xfeedfacf   ; MH_MAGIC_64
        dd 16777223     ; CPU_TYPE_X86 | CPU_ARCH_ABI64
        dd 0x80000003   ; CPU_SUBTYPE_I386_ALL | CPU_SUBTYPE_LIB64
        dd 2            ; MH_EXECUTE
        dd 16           ; number of load commands
        dd ___loadcmdsend - ___loadcmdsstart    ; size of load commands
        dd 0x00200085   ; MH_NOUNDEFS | MH_DYLDLINK | MH_TWOLEVEL | MH_PIE
        dd 0            ; reserved
    ___loadcmdsstart:
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;bits&lt;/code&gt; and &lt;code&gt;cpu&lt;/code&gt; directives just tell &lt;code&gt;nasm&lt;/code&gt; to run in 64-bit mode.&lt;/p&gt;

&lt;p&gt;Immediately after the Mach-O header comes the load commands. There's a whole list of commands which are required for an executable, and a huge pile more which &lt;em&gt;might&lt;/em&gt; be in one. Clang produces 16 load commands for this executable. A load command looks like this:&lt;/p&gt;

&lt;pre&gt;    struct load_command {
        uint32_t cmd;       /* type of load command */
        uint32_t cmdsize;   /* total size of command in bytes */
    };
&lt;/pre&gt;

&lt;p&gt;Each load command is actually larger than this; the &lt;code&gt;cmd&lt;/code&gt; field tells the loader how to interpret the following data. Load commands &lt;strong&gt;must&lt;/strong&gt; be aligned to an 8-byte boundary for 64-bit Mach-O files.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Segments and Sections&lt;/b&gt;&lt;br&gt;Segments are the blocks of data and code which &lt;code&gt;dyld&lt;/code&gt; actually maps into memory at runtime. Sections are subdivisions of segments. Segments and sections both have names, and quite a few are standard and predefined.&lt;/p&gt;

&lt;p&gt;Here's our first segment command:&lt;/p&gt;

&lt;pre&gt;    ___pagezerostart:
        dd 0x19         ; LC_SEGMENT_64
        dd ___pagezeroend - ___pagezerostart    ; command size
        db '__PAGEZERO',0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0            ; VM address
        dq 0x100000000  ; VM size
        dq 0            ; file offset
        dq 0            ; file size
        dd 0x0          ; VM_PROT_NONE (maximum protection)
        dd 0x0          ; VM_PROT_NONE (inital protection)
        dd 0            ; number of sections
        dd 0x0          ; flags
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___pagezeroend:
&lt;/pre&gt;

&lt;p&gt;This is the &lt;code&gt;__PAGEZERO&lt;/code&gt; segment, which predefines the entire lower 4GB of the 64-bit virtual memory space as inaccessible. Because of this segment, which is marked unreadable, unwriteable, and nonexecutable, dereferencing &lt;code&gt;NULL&lt;/code&gt; pointers causes an immediate segmentation fault.&lt;/p&gt;

&lt;p&gt;The next segment command is more complicated:&lt;/p&gt;

&lt;pre&gt;    ___TEXTstart:
        dd 0x19         ; LC_SEGMENT_64
        dd ___TEXTend - ___TEXTstart    ; command size
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000  ; VM address
        dq 0x1000       ; VM size
        dq 0            ; file offset
        dq 0x1000       ; file size
        dd 0x7          ; VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE
        dd 0x5          ; VM_PROT_READ | VM_PROT_EXECUTE
        dd 6            ; number of sections
        dd 0x0          ; flags
    ___TEXTtextstart:
        db '__text',0,0,0,0,0,0,0,0,0,0 ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___codestart - ___TEXTload ; address
        dq ___codeend - ___codestart    ; size
        dd ___codestart ; offset
        dd 0            ; alignment as power of 2 (1)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x80000400   ; S_REGULAR | S_ATTR_PURE_INSTRUCTIONS | S_ATTR_SOME_INSTRUCTIONS
        dd 0            ; reserved1
        dd 0            ; reserved2
        dd 0            ; reserved3
    ___TEXTstubsstart:
        db '__stubs',0,0,0,0,0,0,0,0,0  ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___stubstart - ___TEXTload ; address
        dq ___stubend - ___stubstart    ; size
        dd ___stubstart ; offset
        dd 1            ; alignment as power of 2 (2)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x80000408   ; S_SYMBOL_STUBS | S_ATTR_PURE_INSTRUCTIONS | S_ATTR_SOME_INSTRUCTIONS
        dd 0            ; reserved1 (index into indirect symbol table)
        dd 6            ; reserved2 (size per stub)
        dd 0            ; reserved3
    ___TEXTstubhelperstart:
        db '__stub_helper',0,0,0    ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___stubhelpstart - ___TEXTload ; address
        dq ___stubhelpend - ___stubhelpstart    ; size
        dd ___stubhelpstart ; offset
        dd 2            ; alignment as power of 2 (4)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x80000400   ; S_REGULAR | S_ATTR_PURE_INSTRUCTIONS | S_ATTR_SOME_INSTRUCTIONS
        dd 0            ; reserved1
        dd 0            ; reserved2
        dd 0            ; reserved3
    ___TEXTcstringstart:
        db '__cstring',0,0,0,0,0,0,0    ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___strsstart - ___TEXTload ; address
        dq ___strsend - ___strsstart    ; size
        dd ___strsstart ; offset
        dd 0            ; alignment as power of 2 (1)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x00000002   ; S_CSTRING_LITERALS
        dd 0            ; reserved1
        dd 6            ; reserved2
        dd 0            ; reserved3
    ___TEXTunwindinfostart:
        db '__unwind_info',0,0,0    ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___uwstart - ___TEXTload   ; address
        dq ___uwend - ___uwstart    ; size
        dd ___uwstart   ; offset
        dd 0            ; alignment as power of 2 (1)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x00000000   ; no flags
        dd 0            ; reserved1
        dd 0            ; reserved2
        dd 0            ; reserved3
    ___TEXTehframestart:
        db '__eh_frame',0,0,0,0,0,0 ; section name (pad to 16 bytes)
        db '__TEXT',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100000000 + ___ehstart - ___TEXTload   ; address
        dq ___ehend - ___ehstart    ; size
        dd ___ehstart   ; offset
        dd 3            ; alignment as power of 2 (8)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x00000000   ; no flags
        dd 0            ; reserved1
        dd 0            ; reserved2
        dd 0            ; reserved3
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___TEXTend:
&lt;/pre&gt;

&lt;p&gt;So, this is the &lt;code&gt;__TEXT&lt;/code&gt; segment, which covers all the executable code and a good bit of other data. It contains six sections. Each section is aligned according to its section information, and all the sections are shoved together at the end of the segment, such that the first quite-a-few bytes of &lt;code&gt;__TEXT&lt;/code&gt; are zeroed. However, because of how the linker maps segments, &lt;code&gt;__TEXT&lt;/code&gt; actually includes all the Mach-O headers. As we'll see later, the symbol table even has its own entry for &lt;code&gt;__mh_execute_header&lt;/code&gt;. Here are the sections:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;__text&lt;/code&gt; - The actual &lt;em&gt;code&lt;/em&gt; code of the executable, where all the functions are. In this case, just one function - &lt;code&gt;main()&lt;/code&gt;. It's marked as &lt;code&gt;S_REGULAR&lt;/code&gt;, which means "it's a plain old section", and flagged as containing both "some instructions" (at least some executable code) and "pure instructions" (&lt;em&gt;only&lt;/em&gt; executable code).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;__stubs&lt;/code&gt; - The jump table which redirects into the lazy and non-lazy symbol sections. See my previous article for an explanation of the contents of this section. It's marked as &lt;code&gt;S_SYMBOL_STUBS&lt;/code&gt;, the meaning of which is fairly obvious.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;__stub_helper&lt;/code&gt; - The helper function for lazy dynamically bound symbols.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;__cstring&lt;/code&gt; - A section containing the read-only C string literals used within the code.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;__unwind_info&lt;/code&gt; - The compact unwind information for the executable's code. Generated for exception handling on OS X.
&lt;li&gt;&lt;code&gt;__eh_frame&lt;/code&gt; - The DWARF2 unwind information for the executable's code. Generated for exception handling and debugging.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Next comes the &lt;code&gt;__DATA&lt;/code&gt; segment:&lt;/p&gt;

&lt;pre&gt;    ___DATAstart:
        dd 0x19         ; LC_SEGMENT_64
        dd ___DATAend - ___DATAstart    ; command size
        db '__DATA',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100001000  ; VM address
        dq 0x1000       ; VM size
        dq 0x1000       ; file offset
        dq 0x1000       ; file size
        dd 0x7          ; VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE
        dd 0x3          ; VM_PROT_READ | VM_PROT_WRITE
        dd 2            ; number of sections
        dd 0x0          ; flags
    ___DATAnlsymptrstart:
        db '__nl_symbol_ptr',0  ; section name (pad to 16 bytes)
        db '__DATA',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100001000 + ___nlsymptrstart - ___DATAload ; address
        dq ___nlsymptrend - ___nlsymptrstart    ; size
        dd ___nlsymptrstart ; offset
        dd 3            ; alignment as power of 2 (8)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x00000006   ; S_NON_LAZY_SYMBOL_POINTERS
        dd 2            ; reserved1 (index into indirect symbol table)
        dd 0            ; reserved2
        dd 0            ; reserved3
    ___DATAlasymptrstart:
        db '__la_symbol_ptr',0  ; section name (pad to 16 bytes)
        db '__DATA',0,0,0,0,0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100001000 + ___lasymptrstart - ___DATAload ; address
        dq ___lasymptrend - ___lasymptrstart    ; size
        dd ___lasymptrstart ; offset
        dd 3            ; alignment as power of 2 (8)
        dd 0            ; relocations data offset
        dd 0            ; number of relocations
        dd 0x00000007   ; S_LAZY_SYMBOL_POINTERS
        dd 4            ; reserved1 (index into indirect symbol table)
        dd 0            ; reserved2
        dd 0            ; reserved3
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___DATAend:
&lt;/pre&gt;

&lt;p&gt;There's only two sections here, since this program doesn't have any global or static data: the non-lazy and lazy symbol stubs.&lt;/p&gt;

&lt;p&gt;And then the last segment, &lt;code&gt;__LINKEDIT&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    ___LINKEDITstart:
        dd 0x19         ; LC_SEGMENT_64
        dd ___LINKEDITend - ___LINKEDITstart    ; command size
        db '__LINKEDIT',0,0,0,0,0,0 ; segment name (pad to 16 bytes)
        dq 0x100002000  ; VM address
        dq 0x1000       ; VM size
        dq 0x2000       ; file offset
        dq ___LINKEDITdataend - ___LINKEDITdatastart    ; file size
        dd 0x7          ; VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE
        dd 0x1          ; VM_PROT_READ
        dd 0            ; number of sections
        dd 0x0          ; flags
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___LINKEDITend:
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;__LINKEDIT&lt;/code&gt; segment contains a variety of data used by &lt;code&gt;dyld&lt;/code&gt;, such as the symbol table, the indirect symbol table, the rebase opcodes, the binding opcodes, the exports table, the function starts information, the data-in-code table, and some codesigning data.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Lots and Lots of Linker Data&lt;/b&gt;&lt;br&gt;The next several load commands deal with static and dynamic linking information:&lt;/p&gt;

&lt;pre&gt;    ___dyldinfostart:
        dd 0x80000022   ; LC_DYLD_INFO | LC_REQ_DYLD
        dd ___dyldinfoend - ___dyldinfostart    ; command size
        dd ___rebasestart   ; rebase info offset
        dd ___rebaseend - ___rebasestart    ; rebase info size
        dd ___bindstart ; binding info offset
        dd ___bindend - ___bindstart    ; binding info size
        dd 0            ; weak binding info offset
        dd 0            ; weak binding info size
        dd ___lazystart ; lazy binding info offset
        dd ___lazyend - ___lazystart    ; lazy binding info size
        dd ___exportstart   ; export info offset
        dd ___exportend - ___exportstart    ; export info size
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___dyldinfoend:
    ___symtabinfostart:
        dd 0x2          ; LC_SYMTAB
        dd ___symtabinfoend - ___symtabinfostart    ; command size
        dd ___symtabstart   ; symbol table offset
        dd (___symtabend - ___symtabstart) &amp;gt;&amp;gt; 4 ; number of symbols
        dd ___strtabstart   ; string table offset
        dd ___strtabend - ___strtabstart    ; string table size
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___symtabinfoend:
    ___dysymtabinfostart:
        dd 0xb          ; LC_DYSYMTAB
        dd ___dysymtabinfoend - ___dysymtabinfostart    ; command size
        dd 0            ; local symbols index
        dd 8            ; number of local symbols
        dd 8            ; external symbols index
        dd 2            ; number of external symbols
        dd 10           ; undefined symbols index
        dd 3            ; number of undefined symbols
        dd 0            ; table of contents offset
        dd 0            ; table of contents entries
        dd 0            ; module table offset
        dd 0            ; module table entries
        dd 0            ; external references table offset
        dd 0            ; external references table entries
        dd ___indirsymstart ; indirect symbol table offset
        dd (___indirsymend - ___indirsymstart) &amp;gt;&amp;gt; 2 ; indirect symbol table entries
        dd 0            ; local relocation table offset
        dd 0            ; local relocation table entries
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___dysymtabinfoend:
    ___loaddylinkerstart:
        dd 0xe          ; LC_LOAD_DYLINKER
        dd ___loaddylinkerend - ___loaddylinkerstart    ; command size
        dd ___loaddylinkername - ___loaddylinkerstart   ; offset to name
    ___loaddylinkername:
        db '/usr/lib/dyld',0    ; name
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___loaddylinkerend:
    ___maincmdstart:
        dd 0x80000028   ; LC_MAIN | LC_REQ_DYLD
        dd ___maincmdend - ___maincmdstart  ; command size
        dq _main        ; offset of main from start of __TEXT
        dq 0            ; stack size
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___maincmdend:
    ___loadlibsystemstart:
        dd 0xc          ; LC_LOAD_DYLIB
        dd ___loadlibsystemend - ___loadlibsystemstart  ; command size
        dd ___loadlibsystemname - ___loadlibsystemstart ; offset to path
        dd 2            ; UNIX time stamp Wed Dec 31 19:00:02 1960
        dd 0x00a90300   ; current version (0.169.3.0)
        dd 0x00010000   ; compatibility version (0.1.0.0)
    ___loadlibsystemname:
        db '/usr/lib/libSystem.B.dylib' ; path
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___loadlibsystemend:
    ___fstartscmdstart:
        dd 0x26         ; LC_FUNCTION_STARTS
        dd ___fstartscmdend - ___fstartscmdstart    ; command size
        dd ___functionstartsstart   ; offset to function starts data (fun label name, isn't it?)
        dd ___functionstartsend - ___functionstartsstart    ; size of function starts data (even more fun name!)
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___fstartscmdend:
    ___datacodecmdstart:
        dd 0x29         ; LC_DATA_IN_CODE
        dd ___datacodecmdend - ___datacodecmdstart  ; command size
        dd ___datacodestart ; offset to data-in-code information
        dd ___datacodeend - ___datacodestart ; size of data-in-code information
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___datacodecmdend:
    ___dycodesigncmdstart:
        dd 0x2b         ; LC_DYLIB_CODE_SIGN_DRS
        dd ___dycodesigncmdend - ___dycodesigncmdstart  ; command size
        dd ___dylibcodesignaturesstart  ; offset to code signatures from dylibs
        dd ___dylibcodesignaturesend - ___dylibcodesignaturesstart  ; you get the idea, right?
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___dycodesigncmdend:
&lt;/pre&gt;

&lt;p&gt;To summarize, this long blather of data consists of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A list of dynamic linking info for the binary. This command, along with some others, is marked with &lt;code&gt;LC_REQ_DYLD&lt;/code&gt;, meaning that if the version of &lt;code&gt;dyld&lt;/code&gt; loading the binary doesn't understand the command, it must give up right then rather than continue without the information.&lt;/li&gt;
&lt;li&gt;The location of the symbol and strings tables. These are given as offsets from the beginning of the file, but it is understood that the data is contained within the &lt;code&gt;__LINKEDIT&lt;/code&gt; segment. At runtime, &lt;code&gt;dyld&lt;/code&gt; will perform the calculation &lt;code&gt;symtable_base_address = linkedit_base_address + (symtab_offset - linkedit_offset)&lt;/code&gt; to get the actual location in memory of the symbol table. This is repeated similarly for the strings table, as well as the offsets given in the &lt;code&gt;LC_DYLD_INFO&lt;/code&gt; and &lt;code&gt;LC_DYSYMTAB&lt;/code&gt; commands.&lt;/li&gt;
&lt;li&gt;A set of dynamic symbol data for the binary, giving the offsets and counts within the symbol table for various types of symbols.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;LC_LOAD_DYLINKER&lt;/code&gt; command which gives the hardcoded path for the dynamic linker to load the executable with. This is used by the kernel rather than the dynamic linker, which will run the specified program when the process is spawned. Don't get the idea that you can use this to subvert the loading process, however; the kernel won't let you pick just any dynamic linker.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LC_MAIN&lt;/code&gt;, a replacement for the older &lt;code&gt;LC_UNIXTHREAD&lt;/code&gt; command. It used to be that executables were initialized with a thread state specified within the binary itself, but recently, someone realized this was a waste of time and space with &lt;code&gt;dyld&lt;/code&gt; running early and the state being exactly the same in practically every executable. Instead, &lt;code&gt;LC_MAIN&lt;/code&gt; gives the address of the entry point (&lt;code&gt;main()&lt;/code&gt;) and &lt;code&gt;dyld&lt;/code&gt; jumps right to that instead, also replacing the old &lt;code&gt;crt1.o&lt;/code&gt; object which contained glue code to set up &lt;code&gt;main()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LC_LOAD_DYLIB&lt;/code&gt; is the "I link to this dynamic library for some of my undefined symbols" command. This binary only links to &lt;code&gt;libSystem.B.dylib&lt;/code&gt;, the OS X equivalent of &lt;code&gt;libc&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LC_FUNCTION_STARTS&lt;/code&gt; is a table of data in the &lt;code&gt;__LINKEDIT&lt;/code&gt; segment which gives the address of every function entry point in the executable. Among other things, this allows for functions to exist that have no entries in the symbol table.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LC_DATA_IN_CODE&lt;/code&gt; is similarly a table giving the locations of data bytes which are embedded within executable code. This is useful for any number of purposes, not the least of which is accurate disassembly.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LC_DYLIB_CODE_SIGN_DRS&lt;/code&gt;, finally, gives a list of designated requirements for each dynamic library linked with the executable. This allows the code signing machinery to determine the suitability of the executable without having to load every dynamic library it links to.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;b&gt;A Few More!&lt;/b&gt;&lt;br&gt;Just when you thought we were done, there're three more load commands we haven't covered yet:&lt;/p&gt;

&lt;pre&gt;    ___uuidstart:
        dd 0x1b         ; LC_UUID
        dd ___uuidend - ___uuidstart    ; command size
        db 0xd3,0xec,0x58,0x28,0x02,0x26,0x36,0x29,0xab,0xc3,0x7d,0x6d,0xc9,0xf9,0x2d,0xda  ; D3EC5828-0226-3629-ABC3-7D6DC9F92DDA
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___uuidend:
    ___osverstart:
        dd 0x24         ; LC_VERSION_MIN_MACOSX
        dd ___osverend - ___osverstart  ; command size
        dd 0x000a0800   ; OS min version: 10.8
        dd 0x000a0800   ; Build SDK version: 10.8
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___osverend:
    ___sourceverstart:
        dd 0x2a         ; LC_SOURCE_VERSION
        dd ___sourceverend - ___sourceverstart  ; command size
        dq 0            ; Source version: 0.0.0.0.0
        align 8, db 0   ; pad with zero to 8-byte boundary
    ___sourceverend:
    ___loadcmdsend:
&lt;/pre&gt;

&lt;p&gt;These are the binary's UUID, the version of OS X it's meant for, the version of the SDK it was linked against, and the "source version". I can't find any clue what the "source version" actually is, and it's just a bunch of zeroes in the binaries I've looked at, so your guess is as good as mine.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Finally, Something Else!&lt;/b&gt;&lt;br&gt;The first thing we do now is pad out the file to the start of &lt;code&gt;main()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    ___TEXTload:
        times (0xf14-($-$$)) db 0   ; pad the __TEXT segment
&lt;/pre&gt;

&lt;p&gt;You might ask why I didn't write &lt;code&gt;_main-($-$$)&lt;/code&gt; there, and hardcoded the start address. It certainly looks fragile. Well, it is. The problem is that &lt;code&gt;nasm&lt;/code&gt; doesn't provide a simple means to align data to the "end" of a segment, especially since we're not using its built-in sectioning support. It doesn't know where &lt;code&gt;_main&lt;/code&gt; is until the padding has been added! In this case, I just hardcode the offset where &lt;code&gt;main()&lt;/code&gt; starts (which is the exact value of the &lt;code&gt;__TEXT,__text&lt;/code&gt; section's &lt;code&gt;addr&lt;/code&gt; field) and let it stand as a hack, rather than trying to figure out an elegant-but-complicated solution.&lt;/p&gt;

&lt;p&gt;Now we take the data in order; we don't even really have to do it in any particular order, since the labels we used in the load commands will relocate everything according to where we place it in the file, but there's no reason not to. The first thing is &lt;code&gt;__TEXT,__text&lt;/code&gt;, the executable code. Notice that we have to rewrite the original assembly code to &lt;code&gt;nasm&lt;/code&gt;'s syntax - &lt;code&gt;nasm&lt;/code&gt; uses the Intel syntax, rather than the GNU syntax. The major difference is that all the operands are backwards, and there's no qualifier on the register names. All the various directives are also stripped out, since we're doing their jobs by hand.&lt;/p&gt;

&lt;pre&gt;    ___codestart:
    _main:
        push    rbp
        mov     rbp, rsp
        xor     edi, edi
        call    _time
        lea     rdi, [rel L_str]
        mov     rsi, rax
        xor     al, al
        call    _printf
        xor     eax, eax
        pop     rbp
        ret
    ___codeend:
&lt;/pre&gt;

&lt;p&gt;We also don't have any size suffixes on the instructions, since &lt;code&gt;nasm&lt;/code&gt; can infer them from the operands. The &lt;code&gt;rel&lt;/code&gt; qualifier for the string load just tells &lt;code&gt;nasm&lt;/code&gt; to generate a &lt;code&gt;rip&lt;/code&gt;-relative access instead of an absolute position, which is necessary since we marked the executable as position-independent.&lt;/p&gt;

&lt;p&gt;Next we have the symbol stubs for &lt;code&gt;time()&lt;/code&gt; and &lt;code&gt;printf()&lt;/code&gt;, and the stub helper:&lt;/p&gt;

&lt;pre&gt;    ___stubstart:
    _printf:
        jmp     [rel _lazy_printf]
    _time:
        jmp     [rel _lazy_time]
    ___stubend:

    ___stubhelpstart:
    _stub_helper:
        lea     r11, [rel _nonlazy_dyld_stub_binder]
        push    r11
        jmp     [rel _nonlazy_dyld_stub_binder]
        nop
        push    strict qword (_lazy_printf - ___lasymptrstart)
        jmp     _stub_helper
        push    strict qword (_lazy_time - ___lasymptrstart)
        jmp     _stub_helper
    ___stubhelpend:
&lt;/pre&gt;

&lt;p&gt;The stubs themselves jump to the lazy symbol bindings in the &lt;code&gt;__DATA&lt;/code&gt; segment. These initially jump right back into the bottom of &lt;code&gt;_stub_helper&lt;/code&gt;, which loads the offset into the lazy symbol section of the symbol and calls into &lt;code&gt;dyld&lt;/code&gt; itself through a nonlazy symbol (which will be bound by &lt;code&gt;dyld&lt;/code&gt; when the executable is loaded). &lt;code&gt;dyld&lt;/code&gt; will bind the symbol and rewrite the lazy symbol section so that future calls to that stub go directly to the function. Notice, these are all direct, non-conditional jumps, not subroutine calls. Also notice the use of the &lt;code&gt;strict qword&lt;/code&gt; directives to force &lt;code&gt;nasm&lt;/code&gt; to emit the full 64-bit values for the stack pushes.&lt;/p&gt;

&lt;p&gt;Next comes the C strings section, very short and simple since we only have one string:&lt;/p&gt;

&lt;pre&gt;    ___strsstart:
    L_str:
        db      "Hello, world #%ld!\n",0
    ___strsend:
&lt;/pre&gt;

&lt;p&gt;And now the unwinding table. This is encoded with the "compact unwind encoding" defined by Apple (as far as I know).&lt;/p&gt;

&lt;pre&gt;    ___uwstart:
        dd 1            ; unwind info version
        dd _commonEncodings - ___uwstart    ; common encodings array offset
        dd 0            ; count of common encodings
        dd _personalities - ___uwstart  ; personality array offset
        dd 0            ; count of personalities
        dd _index - ___uwstart  ; first-level index offset
        dd 2            ; count of entries in first-level index
    _commonEncodings:
    _personalities:
    _index:
    __entry1_0:
        dd _main        ; function offset
        dd __entry2_0 - ___uwstart  ; offset to second-level entry
        dd _lsda - ___uwstart   ; offset to language-specific data array entry
    __entry1_1:
        dd ___codeend+1 ; function offset (end of table)
        dd 0            ; offset to second-level entry - zero means end of table
        dd _lsda - ___uwstart   ; offset to LSDA
    _lsda:
    _pages:
    __entry2_0:
        dd 3            ; UNWIND_SECOND_LEVEL_COMPRESSED
        dw ___entrypage0 - __entry2_0   ; offset to entry page
        dw 1            ; number of entries in entry page
        dw ___enc0 - __entry2_0 ; offset to encoding page
        dw 1            ; number of entries in encoding page
    ___entrypage0:
    ____entrypage0_0:
        dd (0 &amp;lt;&amp;lt; 24) | (0)  ; encoding index and function offset relative to first-level index offset
    ___enc0:
    ____enc0_0:
        dd 0x01000000   ; UNWIND_X86_64_MODE_RBP_FRAME | UNWIND_X86_64_REG_NONE
    ___uwend:
&lt;/pre&gt;

&lt;p&gt;And then the DWARF-encoded version of the same information. To save everyone some time, I'm not going to write this part out with all the comments, because it's complex and it just duplicates the unwinding info above in a much more verbose fashion.&lt;/p&gt;

&lt;pre&gt;    ___ehstart:
        db 0x14,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x01,0x7a,0x52,0x00,0x01,0x78,0x10,0x01
        db 0x10,0x0c,0x07,0x08,0x90,0x01,0x00,0x00,0x24,0x00,0x00,0x00,0x1c,0x00,0x00,0x00
        db 0x34,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x20,0x00,0x00,0x00,0x00,0x00,0x00,0x00
        db 0x00,0x41,0x0e,0x10,0x86,0x02,0x43,0x0d,0x06,0x00,0x00,0x00,0x00,0x00,0x00,0x00
    ___ehend:
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Data, data, data... well, sort of&lt;/b&gt;&lt;br&gt;That ends off the &lt;code&gt;__TEXT&lt;/code&gt; segment. Now we have the &lt;code&gt;__DATA&lt;/code&gt; segment, which contains the lazy and non-lazy symbol pointers:&lt;/p&gt;

&lt;pre&gt;    ___DATAload:

    ___nlsymptrstart:
    _nonlazy_dyld_stub_binder:
        dq 0x0000000000000000
    _nonlazy_table_start:
        dq 0x0000000000000000
    ___nlsymptrend:

    ___lasymptrstart:
    _lazy_printf:
        dq 0x100000000 + _stub_helper_printf
    _lazy_time:
        dq 0x100000000 + _stub_helper_time
    ___lasymptrend:
&lt;/pre&gt;

&lt;p&gt;In a real executable, &lt;code&gt;__DATA&lt;/code&gt; would usually also contain static data, space for globals, and some other stuff.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The link editor&lt;/b&gt;&lt;br&gt;&lt;code&gt;__LINKEDIT&lt;/code&gt; is a real pain, because it's arbitrarily structured and the data within it isn't always all that documented. I've done my best to represent what's in it comprehensibly, but I can't guarantee I've succeeded.&lt;/p&gt;

&lt;p&gt;We start with the rebasing opcodes, which &lt;code&gt;dyld&lt;/code&gt; uses when applying ASLR:&lt;/p&gt;

&lt;pre&gt;    ___rebasestart:
        db 0x10 | 0x01  ; REBASE_OPCODE_SET_TYPE_IMM | REBASE_TYPE_POINTER
        db 0x20 | 0x02  ; REBASE_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | indexOfSegment(__DATA) (2)
        db 0x10         ; uleb128_encode(_lazy_printf - ___DATAload)
        db 0x50 | 0x02  ; REBASE_OPCODE_DO_REBASE_IMM_TIMES | 2
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___rebaseend:
&lt;/pre&gt;

&lt;p&gt;This says, "using pointers, in the __DATA segment at offset 0x10, rebase 2 pointers based on the load address of that segment".&lt;/p&gt;

&lt;p&gt;Next come the binding opcodes and lazy binding opcodes:&lt;/p&gt;

&lt;pre&gt;    ___bindstart:
        db 0x11         ; BIND_OPCODE_SET_DYLIB_ORDINAL_IMM | 1
        db 0x40         ; BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM | 0
        db 'dyld_stub_binder',0 ; immediate operand
        db 0x51         ; BIND_OPCODE_SET_TYPE_IMM | BIND_TYPE_POINTER
        db 0x72         ; BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | indexOfSegment(__DATA) (2)
        db 0x00         ; uleb128_encode(0)
        db 0x90         ; BIND_OPCODE_DO_BIND
        db 0x00         ; BIND_OPCODE_DONE
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___bindend:
    ___lazystart:
        db 0x72,0x10    ; BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | indexOfSegment(__DATA) (2), uleb128_encode(0x10)
        db 0x11         ; BIND_OPCODE_SET_DYLIB_ORDINAL_IMM | 1
        db 0x40,'_printf',0 ; BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM | 0, '_printf'
        db 0x90,0x00    ; BIND_OPCODE_DO_BIND, BIND_OPCODE_DONE
        db 0x72,0x18    ; BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | indexOfSegment(__DATA) (2), uleb128_encode(0x18)
        db 0x11         ; BIND_OPCODE_SET_DYLIB_ORDINAL_IMM | 1
        db 0x40,'_time',0   ; BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM | 0, '_time'
        db 0x90,0x00    ; BIND_OPCODE_DO_BIND, BIND_OPCODE_DONE
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___lazyend:
&lt;/pre&gt;

&lt;p&gt;These opcodes bind a non-lazy symbol named &lt;code&gt;dyld_stub_binder&lt;/code&gt; to offset 0 in the &lt;code&gt;__DATA&lt;/code&gt; segment as a pointer. For lazy symbols, they bind a symbol named &lt;code&gt;_printf&lt;/code&gt; to offset &lt;code&gt;0x10&lt;/code&gt; in the &lt;code&gt;__DATA&lt;/code&gt; segment and &lt;code&gt;_time&lt;/code&gt; to offset &lt;code&gt;0x18&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And here's the export trie:&lt;/p&gt;

&lt;pre&gt;    ___exportstart:
    _exnode0:
        db 0x00         ; terminal size
        db 0x01         ; child count
        db '_',0        ; name
        db _exnode1 - ___exportstart    ; child node offset
    _exnode1:
        db 0x00         ; terminal size
        db 0x02         ; child count
        db '_mh_execute_header',0   ; name
        db _exnode3 - ___exportstart    ; child node offset
    _exnode2:
        db 'main',0     ; name
        db _exnode4 - ___exportstart    ; child node offset
    _exnode3:
        db 0x02         ; terminal size
        db 0x00         ; flags
        db 0x00         ; address - uleb128_encode(0)
        db 0x00         ; child count
    _exnode4:
        db 0x03         ; terminal size
        db 0x00         ; flags
        db 0x94,0x1e    ; address - uleb128_encode(0xf14)
        db 0x00         ; child count
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___exportend:
&lt;/pre&gt;

&lt;p&gt;This forms a trie, or prefix tree, for the two symbols exported by the executable, &lt;code&gt;__mh_execute_header&lt;/code&gt; and &lt;code&gt;_main&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Have the compressed function starts table, represented as a set of deltas to be added to the base code address:&lt;/p&gt;

&lt;pre&gt;    ___functionstartsstart:
        db 0x94         ; delta = 0x14, address  = ___codestart
        db 0x1e         ; delta = 0x1e, end 
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___functionstartsend:
&lt;/pre&gt;

&lt;p&gt;Here's the data-in-code table. Whoops, there isn't any in this executable, the load command's just added anyway:&lt;/p&gt;

&lt;pre&gt;    ___datacodestart:
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___datacodeend:
&lt;/pre&gt;

&lt;p&gt;How about some designated requirements for dylibs? I have no real idea what format this is in, I just interpreted it as best I could:&lt;/p&gt;

&lt;pre&gt;    ___dylibcodesignaturesstart:
        dd 1            ; count of code signatures (maybe?)
        dd 0            ; unknown
        dd 0x14         ; unknown
        db 0xfa,0xde,0x0c,0x00,0x00,0x00,0x00,0x28
        db 0x00,0x00,0x00,0x01,0x00,0x00,0x00,0x06
        db 0x00,0x00,0x00,0x02,0x00,0x00,0x00,0x0b
        db 0x6c,0x69,0x62,0x53,0x79,0x73,0x74,0x65
        db 0x6d,0x2e,0x42,0x00,0x00,0x00,0x00,0x03  ; code signature for libSystem.B.dylib
        dd 0            ; unknown
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___dylibcodesignaturesend:
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;A symbol table&lt;/b&gt;&lt;br&gt;The symbol table is where most the interesting stuff that's left happens:&lt;/p&gt;

&lt;pre&gt;    ___symtabstart:
        dd L_srcdir - ___strtabstart    ; string table offset
        db 0x64         ; N_SO
        db 0x00         ; section 0
        dw 0x00         ; no desc
        dq 0            ; address 0
        dd L_srcfile - ___strtabstart   ; string table offset
        db 0x64         ; N_SO
        db 0x00         ; section 0
        dw 0x00         ; no desc
        dq 0            ; address 0
        dd L_objfile - ___strtabstart   ; string table offset
        db 0x66         ; N_OSO
        db 0x03         ; section 3
        dw 0x01         ; desc(?)
        dq 0x50b8c91f   ; st_mtime
        dd L_empty - ___strtabstart ; no string
        db 0x2e         ; N_BNSYM
        db 0x01         ; section 1
        dw 0x00         ; desc
        dq 0x100000000 + _main      ; start address
        dd L_main1 - ___strtabstart ; string table offset
        db 0x24         ; N_FUN
        db 0x01         ; section 1
        dw 0x00         ; desc
        dq 0x100000f14  ; start address
        dd L_empty - ___strtabstart ; no string
        db 0x24         ; N_FUN
        db 0x00         ; section 0
        dw 0x00         ; desc
        dq 0x20         ; address
        dd L_empty - ___strtabstart ; no string
        db 0x4e         ; N_ENSYM
        db 0x01         ; section 1
        dw 0x00         ; desc
        dw 0x20         ; address
    _sym_mh_execute_header:
        dd L_mhexechead - ___strtabstart    ; string table offset
        db 0x0f         ; N_SECT | N_EXT
        db 0x01         ; section 1
        dw 0x0010       ; REFERENCED_DYNAMICALLY
        dq 0x100000000 + __mh_execute_header    ; start address
    _sym_main:
        dd L_main2 - ___strtabstart ; string table offset
        db 0x0f         ; N_SECT | N_EXT
        dw 0x0000       ; no extra flags
        dq 0x100000000 + _main  ; start address
    _sym_printf:
        dd L_printf - ___strtabstart    ; string table offset
        db 0x01         ; N_UNDF | N_EXT
        dw 0x0100       ; dynamic library 1
        dq 0            ; address
    _sym_time:
        dd L_time - ___strtabstart  ; string table offset
        db 0x01         ; N_UNDF | N_EXT
        dw 0x0100       ; dynamic library 1
        dq 0            ; address
    _sym_dyld_stub_binder:
        dd L_binder - ___strtabstart    ; string table offset
        db 0x01         ; N_UNDF | N_EXT
        dw 0x0100       ; dynamic library 1
        dq 0            ; address
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___symtabend:

    ___indirsymstart:
        dd (_sym_printf - ___symtabstart) &amp;gt;&amp;gt; 4  ; index into symbol table
        dd (_sym_time - ___symtabstart) &amp;gt;&amp;gt; 4    ; index into symbol table
        dd (_sym_dyld_stub_binder - ___symtabstart) &amp;gt;&amp;gt; 4    ; index into symbol table
        dd 0x40000000   ; INDIRECT_SYMBOL_ABS
        dd (_sym_printf - ___symtabstart) &amp;gt;&amp;gt; 4  ; index into symbol table
        dd (_sym_time - ___symtabstart) &amp;gt;&amp;gt; 4    ; index into symbol table
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___indirsymend:

    ___strtabstart:
    L_spc:
        db ' '
    L_empty:
        db 0
    L_srcdir:
        db '/Users/gwynne/',0
    L_srcfile:
        db 'test.c',0
    L_objfile:
        db '/var/folders/b8/qgjb841d71d55cf8jh1myb540000gn/T/test-KyuIba.o',0
    L_main1:
        db '_main',0
    L_mhexechead:
        db '__mh_execute_header',0
    L_main2:
        db '_main',0
    L_printf:
        db '_printf',0
    L_time:
        db '_time',0
    L_binder:
        db 'dyld_stub_binder',0
        align 8, db 0   ; pad with 0 to 8-byte boundary
    ___strtabend:

    ___LINKEDITdataend:
&lt;/pre&gt;

&lt;p&gt;Here you have the symbol table (including STABS entries), the indirect symbol table (which is nothing but a set of indexes into the symbol table which tell &lt;code&gt;dyld&lt;/code&gt; how to use the symbol stubs in the event that the binding opcodes aren't good enough - basically, legacy data), and the string table, which holds all the user-readable strings for the symbol table.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;That is one long mess of mostly raw hexadecimal bytes. And here's the punch line: As written here, it &lt;strong&gt;still&lt;/strong&gt; doesn't produce a working Mach-O binary!&lt;/p&gt;

&lt;p&gt;Why not? Because I didn't account for alignment requirements properly, and I ran out of time to fix the problem before the article had to go up. All the tables and structures here are correct, though, so hopefully, it's still instructional as to just how much goes into even the simplest binary, and how much work you should be very glad &lt;code&gt;ld&lt;/code&gt; and &lt;code&gt;dyld&lt;/code&gt; are doing for you!&lt;/p&gt;

&lt;p&gt;Thanks for reading, as always. I hope you enjoyed it!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;</description><author>Gwynne Raskind</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-11-30-lets-build-a-mach-o-executable.html</guid><pubDate>Fri, 30 Nov 2012 17:59:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-11-16: Let's Build objc_msgSend
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-11-16-lets-build-objc_msgsend.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-11-16: Let's Build objc_msgSend
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 11 16  14 27"
                  tags="fridayqna objectivec letsbuild"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-11-16: Let's Build objc_msgSend
&lt;/div&gt;
              &lt;p&gt;The &lt;code&gt;objc_msgSend&lt;/code&gt; function underlies everything we do in Objective-C. Gwynne Raskind, reader and occasional Friday Q&amp;amp;A guest contributor, suggested that I talk about how &lt;code&gt;objc_msgSend&lt;/code&gt; works on the inside. What better way to understand how something works than to build it from scratch? Let's build &lt;code&gt;objc_msgSend&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Tramapoline! Trampopoline!&lt;/b&gt;&lt;br&gt;Whenever you write an Objective-C message send:&lt;/p&gt;

&lt;pre&gt;    [obj message]
&lt;/pre&gt;

&lt;p&gt;The compiler generates a call to &lt;code&gt;objc_msgSend&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    objc_msgSend(obj, @selector(message));
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;objc_msgSend&lt;/code&gt; then takes care of dispatching the message.&lt;/p&gt;

&lt;p&gt;How does it do that? It looks up the appropriate function pointer, or &lt;code&gt;IMP&lt;/code&gt;, to invoke, then jumps to it. Any arguments passed to &lt;code&gt;objc_msgSend&lt;/code&gt; end up being arguments to the &lt;code&gt;IMP&lt;/code&gt; after the jump. The return value from the &lt;code&gt;IMP&lt;/code&gt; ends up as the return value seen by the caller.&lt;/p&gt;

&lt;p&gt;Because &lt;code&gt;objc_msgSend&lt;/code&gt; only takes control long enough to obtain the right function pointer and directly jump to it, it's sometimes referred to as a &lt;em&gt;trampoline&lt;/em&gt;. In general, any small piece of code that serves to redirect code somewhere else can be called a trampoline.&lt;/p&gt;

&lt;p&gt;It is this trampolining behavior that makes &lt;code&gt;objc_msgSend&lt;/code&gt; special. Because it simply looks up the right code and then jumps directly to it, it's relatively generic. It works with &lt;em&gt;any&lt;/em&gt; combination of parameters passed to it, because it just leaves them alone for the method &lt;code&gt;IMP&lt;/code&gt; to read. Return values are a bit trickier, but it turns out that every possible return type can be accounted for with just a couple of variants of &lt;code&gt;objc_msgSend&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, this trampoline behavior cannot be written in pure C. There is no way to write a C function that passes through generic parameters to another function. You can come close by using variable arguments, but variable arguments are passed differently from normal arguments and in a way that's slower, so it's not compatible with regular C parameters.&lt;/p&gt;

&lt;p&gt;If you &lt;em&gt;could&lt;/em&gt; write &lt;code&gt;objc_msgSend&lt;/code&gt; in C, the basic idea would look something like this:&lt;/p&gt;

&lt;pre&gt;    id objc_msgSend(id self, SEL _cmd, ...)
    {
        Class c = object_getClass(self);
        IMP imp = class_getMethodImplementation(c, _cmd);
        return imp(self, _cmd, ...);
    }
&lt;/pre&gt;

&lt;p&gt;This is actually a bit over-simplified. There's a method cache to make the whole lookup faster, so it's more like this:&lt;/p&gt;

&lt;pre&gt;    id objc_msgSend(id self, SEL _cmd, ...)
    {
        Class c = object_getClass(self);
        IMP imp = cache_lookup(c, _cmd);
        if(!imp)
            imp = class_getMethodImplementation(c, _cmd);
        return imp(self, _cmd, ...);
    }
&lt;/pre&gt;

&lt;p&gt;Except that, for speed, &lt;code&gt;cache_lookup&lt;/code&gt; is implemented inline.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Assembly&lt;/b&gt;&lt;br&gt;In Apple's runtime, the whole function is implemented in assembly for maximum speed. &lt;code&gt;objc_msgSend&lt;/code&gt; runs for every single Objective-C message send, and the simplest action in app can result in thousands or millions of messages.&lt;/p&gt;

&lt;p&gt;To simplify things a bit, my own implementation will do the bare minimum in assembly, with all of the smarts in a separate C function. The assembly itself will do the equivalent of:&lt;/p&gt;

&lt;pre&gt;    id objc_msgSend(id self, SEL _cmd, ...)
    {
        IMP imp = GetImplementation(self, _cmd);
        imp(self, _cmd, ...);
    }
&lt;/pre&gt;

&lt;p&gt;Then &lt;code&gt;GetImplementation&lt;/code&gt; can do all of the work in a more understandable fashion.&lt;/p&gt;

&lt;p&gt;The assembly code needs to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save all potential parameters somewhere safe, so that &lt;code&gt;GetImplementation&lt;/code&gt; won't overwrite them.&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;GetImplementation&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Save the return value somewhere.&lt;/li&gt;
&lt;li&gt;Restore all of the parameter values.&lt;/li&gt;
&lt;li&gt;Jump to the &lt;code&gt;IMP&lt;/code&gt; returned from &lt;code&gt;GetImplementation&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So let's get started!&lt;/p&gt;

&lt;p&gt;I'm going to use x86-64 assembly here, as it's the most convenient to work with on a Mac. The same principles would apply for i386 or ARM.&lt;/p&gt;

&lt;p&gt;This function goes into its own file, which I called &lt;code&gt;msgsend-asm.s&lt;/code&gt;. This file can be passed to the compiler as just another source file, and it will assemble it and link it into the rest of the program.&lt;/p&gt;

&lt;p&gt;The first thing to do is to actually declare the global symbol. For boring historical reasons, C functions get an extra leading underscore in their global symbol name:&lt;/p&gt;

&lt;pre&gt;    .globl _objc_msgSend
    _objc_msgSend:
&lt;/pre&gt;

&lt;p&gt;The compiler will happily link against the nearest available &lt;code&gt;objc_msgSend&lt;/code&gt;. Simply linking this into a test app is enough to get &lt;code&gt;[obj message]&lt;/code&gt; expressions going to our own code rather than Apple's runtime, which is terribly convenient when it comes to testing this code to make sure it actually works.&lt;/p&gt;

&lt;p&gt;Integer and pointer parameters are passed in registers &lt;code&gt;%rsi&lt;/code&gt;, &lt;code&gt;%rdi&lt;/code&gt;, &lt;code&gt;%rdx&lt;/code&gt;, &lt;code&gt;%rcx&lt;/code&gt;, &lt;code&gt;%r8&lt;/code&gt;, and &lt;code&gt;%r9&lt;/code&gt;. Any additional parameters beyond what would fit in there get passed on the stack. The first thing this function does is save those six registers onto the stack as well, so they can be restored later:&lt;/p&gt;

&lt;pre&gt;    pushq %rsi
    pushq %rdi
    pushq %rdx
    pushq %rcx
    pushq %r8
    pushq %r9
&lt;/pre&gt;

&lt;p&gt;In addition to these registers, the &lt;code&gt;%rax&lt;/code&gt; register acts as something of a hidden parameter. It's used for variable-argument calls, and in that case it stores the number of vector registers passed in, which is used by the called function to properly prepare the variable argument list. In case the target method is a variable-argument method, I save this register as well:&lt;/p&gt;

&lt;pre&gt;    pushq %rax
&lt;/pre&gt;

&lt;p&gt;For completeness, the &lt;code&gt;%xmm&lt;/code&gt; registers used to pass floating-point arguments really ought to be saved as well. However, if I can safely assume that &lt;code&gt;GetImplementation&lt;/code&gt; doesn't use any floating point, then I can ignore them, which I do simply to keep the code shorter.&lt;/p&gt;

&lt;p&gt;Next, I align the stack. Mac OS X requires that the stack be aligned to a 16-byte boundary when making function calls. The above code leaves us with an aligned stack anyway, but it's nice to have code to explicitly handle it so that you don't have to worry about making sure everything is lined up, or wondering why your app is crashing in &lt;code&gt;dyld&lt;/code&gt; functions. To align the stack, I save the existing stack pointer into &lt;code&gt;%r12&lt;/code&gt; after saving the original value of &lt;code&gt;%r12&lt;/code&gt; onto the stack. The choice of &lt;code&gt;%r12&lt;/code&gt; is somewhat arbitrary, and any caller-saved register would do. The important thing is that the value is guaranteed to survive across the call to &lt;code&gt;GetImplementation&lt;/code&gt;. Then I &lt;code&gt;and&lt;/code&gt; the stack pointer with &lt;code&gt;-0x10&lt;/code&gt;, which just clears the bottom four bits:&lt;/p&gt;

&lt;pre&gt;    pushq %r12
    mov %rsp, %r12
    andq $-0x10, %rsp
&lt;/pre&gt;

&lt;p&gt;Now the stack pointer is aligned. It's also safely past any of the saved registers from above, since the stack grows down, and this alignment procedure will only move it further down.&lt;/p&gt;

&lt;p&gt;It's finally time to call into &lt;code&gt;GetImplementation&lt;/code&gt;. It takes two parameters, &lt;code&gt;self&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt;. Calling conventions are for those two parameters to go into &lt;code&gt;%rsi&lt;/code&gt; and &lt;code&gt;%rdi&lt;/code&gt;, respectively. However, they were passed into &lt;code&gt;objc_msgSend&lt;/code&gt; like that, and haven't been moved, so nothing has to be done to get them into place. All that has to be done is actually make the call to &lt;code&gt;GetImplementation&lt;/code&gt;, which also gets a leading underscore:&lt;/p&gt;

&lt;pre&gt;    callq _GetImplementation
&lt;/pre&gt;

&lt;p&gt;Integer and pointer return values are returned in &lt;code&gt;%rax&lt;/code&gt;, so that's where the returned &lt;code&gt;IMP&lt;/code&gt; is found. Since &lt;code&gt;%rax&lt;/code&gt; has to be restored to its original state, the returned &lt;code&gt;IMP&lt;/code&gt; needs to be moved elsewhere. I arbitrarily chose to store it into &lt;code&gt;%r11&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    mov %rax, %r11
&lt;/pre&gt;

&lt;p&gt;Now it's time to start putting things back the way they were. The first item is to restore the stack pointer, which was stashed in &lt;code&gt;%r12&lt;/code&gt;, and restore the old value of &lt;code&gt;%r12&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    mov %r12, %rsp
    popq %r12
&lt;/pre&gt;

&lt;p&gt;Then pop all of the argument registers off the stack in the opposite order from when they were pushed:&lt;/p&gt;

&lt;pre&gt;    popq %rax
    popq %r9
    popq %r8
    popq %rcx
    popq %rdx
    popq %rdi
    popq %rsi
&lt;/pre&gt;

&lt;p&gt;Everything is now ready. The argument registers are restored to how they were before. All parameters intended for the target method are in the place where the target method will expect to find them. The &lt;code&gt;IMP&lt;/code&gt; itself is in &lt;code&gt;%r11&lt;/code&gt;, so all that has to be done is to jump there:&lt;/p&gt;

&lt;pre&gt;    jmp *%r11
&lt;/pre&gt;

&lt;p&gt;And that's it! There's nothing more to be done in the assembly code. The jump passes control to the method implementation. From the perspective of that code, it looks &lt;em&gt;exactly&lt;/em&gt; as if the message sender directly invoked the method. All of the indirection above just disappears. When the method returns, it will return directly to the caller of &lt;code&gt;objc_msgSend&lt;/code&gt; without any further intervention. Any return value from the method will be found in the correct place.&lt;/p&gt;

&lt;p&gt;There's a bit of subtlety when it comes to unusual return values. Large &lt;code&gt;struct&lt;/code&gt;s (anything too large to be returned in a register) are the most common example of this. On x86-64, large &lt;code&gt;struct&lt;/code&gt;s are returned by using a hidden first parameter. When you make a call like this:&lt;/p&gt;

&lt;pre&gt;    NSRect r = SomeFunc(a, b, c);
&lt;/pre&gt;

&lt;p&gt;The call gets translated to something more like this:&lt;/p&gt;

&lt;pre&gt;    NSRect r;
    SomeFunc(&amp;amp;r, a, b, c);
&lt;/pre&gt;

&lt;p&gt;The address of memory to use for the return value gets passed in &lt;code&gt;%rdi&lt;/code&gt;. Since &lt;code&gt;objc_msgSend&lt;/code&gt; expects &lt;code&gt;%rdi&lt;/code&gt; and &lt;code&gt;%rsi&lt;/code&gt; to contain &lt;code&gt;self&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt;, it won't work for messages that return large &lt;code&gt;struct&lt;/code&gt;s. This same basic problem exists on many different platforms. The runtime solves this problem by providing a separate &lt;code&gt;objc_msgSend_stret&lt;/code&gt; function used for &lt;code&gt;struct&lt;/code&gt; returns, which works like &lt;code&gt;objc_msgSend&lt;/code&gt;, but knows to find &lt;code&gt;self&lt;/code&gt; in &lt;code&gt;%rsi&lt;/code&gt; and &lt;code&gt;_cmd&lt;/code&gt; in &lt;code&gt;%rdx&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A similar problem arises on some platforms with messages that return floating point values. On those platforms, the runtime provides &lt;code&gt;objc_msgSend_fpret&lt;/code&gt; (and on x86-64, &lt;code&gt;objc_msgSend_fpret2&lt;/code&gt; for extremely special cases).&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Method Lookup&lt;/b&gt;&lt;br&gt;Let's move on to the implementation of &lt;code&gt;GetImplementation&lt;/code&gt;. The above assembly trampoline means that this code can be written in C. Remember that in the real runtime, this code is all straight assembly, in order to get the best speed possible. Not only does this allow for fine control over the code, but it also eliminates the need to save and restore all of those registers like the code above does.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GetImplementation&lt;/code&gt; could simply call &lt;code&gt;class_getMethodImplementation&lt;/code&gt; and be done with it, foisting all of the work onto the Objective-C runtime. This is a bit boring, though. The real &lt;code&gt;objc_msgSend&lt;/code&gt; looks in the class's method cache first, for maximum speed. Since &lt;code&gt;GetImplementation&lt;/code&gt; is intended to mimic &lt;code&gt;objc_msgSend&lt;/code&gt;, it will do the same. Only if the cache doesn't contain an entry for the given selector will it fall back to querying the runtime.&lt;/p&gt;

&lt;p&gt;The first thing we need is some &lt;code&gt;struct&lt;/code&gt; definitions. The method cache is a private set of structures accessed through the class structure, so to get to it we need our own definitions. Note that, while private, these definitions are all available as part of Apple's open source release of the Objective-C runtime.&lt;/p&gt;

&lt;p&gt;First comes the definition for a single cache entry:&lt;/p&gt;

&lt;pre&gt;    typedef struct {
        SEL name;
        void *unused;
        IMP imp;
    } cache_entry;
&lt;/pre&gt;

&lt;p&gt;Pretty easy. Don't ask me about the &lt;code&gt;unused&lt;/code&gt; field, I don't know why that's there. Here's the definition for the cache as a whole:&lt;/p&gt;

&lt;pre&gt;    struct objc_cache {
        uintptr_t mask;
        uintptr_t occupied;        
        cache_entry *buckets[1];
    };
&lt;/pre&gt;

&lt;p&gt;The cache is implemented as a hash table. This table is built for speed and simplicity over all else, so it's a bit unusual. The table size is always a power of two. The table is indexed by selector, and the bucket index is computed by simply taking the selector's value, possibly shifting it to get rid of irrelevant low bits, and performing a logical &lt;em&gt;and&lt;/em&gt; with the appropriate mask. While we're at it, here are macros used to compute the bucket index for a particular selector and mask:&lt;/p&gt;

&lt;pre&gt;    #ifndef __LP64__
    # define CACHE_HASH(sel, mask) (((uintptr_t)(sel)&amp;gt;&amp;gt;2) &amp;amp; (mask))
    #else
    # define CACHE_HASH(sel, mask) (((unsigned int)((uintptr_t)(sel)&amp;gt;&amp;gt;0)) &amp;amp; (mask))
    #endif
&lt;/pre&gt;

&lt;p&gt;Finally, there's the structure for the class itself. This is what a &lt;code&gt;Class&lt;/code&gt; actually points to:&lt;/p&gt;

&lt;pre&gt;    struct class_t {
        struct class_t *isa;
        struct class_t *superclass;
        struct objc_cache *cache;
        IMP *vtable;
    };
&lt;/pre&gt;

&lt;p&gt;Let's get started with &lt;code&gt;GetImplementation&lt;/code&gt; now that the necessary structs are there:&lt;/p&gt;

&lt;pre&gt;    IMP GetImplementation(id self, SEL _cmd)
    {
&lt;/pre&gt;

&lt;p&gt;The first thing it does is get the object's class. The real &lt;code&gt;objc_msgSend&lt;/code&gt; does this with the equivalent of &lt;code&gt;self-&amp;gt;isa&lt;/code&gt;, but I'll be gentle and use the official API for that part:&lt;/p&gt;

&lt;pre&gt;        Class c = object_getClass(self);
&lt;/pre&gt;

&lt;p&gt;Since I want access to the guts, I'll immediately cast to a pointer to the &lt;code&gt;class_t&lt;/code&gt; &lt;code&gt;struct&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        struct class_t *classInternals = (struct class_t *)c;
&lt;/pre&gt;

&lt;p&gt;Now it's time to look up the &lt;code&gt;IMP&lt;/code&gt;. We'll start off with it set to &lt;code&gt;NULL&lt;/code&gt;. If we find an entry in the cache, we'll set it. If it's still &lt;code&gt;NULL&lt;/code&gt; after checking the cache, we'll fall back to the slow path:&lt;/p&gt;

&lt;pre&gt;        IMP imp = NULL;
&lt;/pre&gt;

&lt;p&gt;Next, grab a pointer to the cache:&lt;/p&gt;

&lt;pre&gt;        struct objc_cache *cache = classInternals-&amp;gt;cache;
&lt;/pre&gt;

&lt;p&gt;Compute the bucket index, and grab a pointer to the array of buckets:&lt;/p&gt;

&lt;pre&gt;        uintptr_t index = CACHE_HASH(_cmd, cache-&amp;gt;mask);
        cache_entry **buckets = cache-&amp;gt;buckets;
&lt;/pre&gt;

&lt;p&gt;Next, we search for a cache entry with the appropriate selector. The runtime uses linear chaining, so it's just a matter of searching subsequent buckets until either we find a match or find a &lt;code&gt;NULL&lt;/code&gt; entry:&lt;/p&gt;

&lt;pre&gt;        for(; buckets[index] != NULL; index = (index + 1) &amp;amp; cache-&amp;gt;mask)
        {
            if(buckets[index]-&amp;gt;name == _cmd)
            {
                imp = buckets[index]-&amp;gt;imp;
                break;
            }
        }
&lt;/pre&gt;

&lt;p&gt;If no entry was found, we fall back to the slow path and call into the runtime. In the real &lt;code&gt;objc_msgSend&lt;/code&gt;, all of the above code is written in assembly, and this is the point where it would drop out of assembly and call into the runtime itself. Once the cache has been tried and no entry was found, any hope for a fast message send is gone. The need to go fast becomes much less important at this point, partly because it's already doomed to be slow, and partly because this path should be taken extremely rarely. Because of that, it's acceptable to drop out of the assembly code and call into more maintainable C:&lt;/p&gt;

&lt;pre&gt;        if(imp == NULL)
            imp = class_getMethodImplementation(c, _cmd);
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;IMP&lt;/code&gt; has now been obtained, one way or another. If it was in the cache, it was retrieved from there, and otherwise it was populated by the runtime. The &lt;code&gt;class_getMethodImplementation&lt;/code&gt; call will also populate the cache, so subsequent calls will go faster. All that's left is to return it the &lt;code&gt;IMP&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        return imp;
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Testing&lt;/b&gt;&lt;br&gt;To make sure this stuff actually works, I whipped up a quick test program:&lt;/p&gt;

&lt;pre&gt;    @interface Test : NSObject
    - (void)none;
    - (void)param: (int)x;
    - (void)params: (int)a : (int)b : (int)c : (int)d : (int)e : (int)f : (int)g;
    - (int)retval;
    @end

    @implementation Test

    - (id)init
    {
        fprintf(stderr, "in init method, self is %p\n", self);
        return self;
    }

    - (void)none
    {
        fprintf(stderr, "in none method\n");
    }

    - (void)param: (int)x
    {
        fprintf(stderr, "got parameter %d\n", x);
    }

    - (void)params: (int)a : (int)b : (int)c : (int)d : (int)e : (int)f : (int)g
    {
        fprintf(stderr, "got params %d %d %d %d %d %d %d\n", a, b, c, d, e, f, g);
    }

    - (int)retval
    {
        fprintf(stderr, "in retval method\n");
        return 42;
    }

    @end


    int main(int argc, char **argv)
    {
        for(int i = 0; i &amp;lt; 20; i++)
        {
            Test *t = [[Test alloc] init];
            [t none];
            [t param: 9999];
            [t params: 1 : 2 : 3 : 4 : 5 : 6 : 7];
            fprintf(stderr, "retval gave us %d\n", [t retval]);

            NSMutableArray *a = [[NSMutableArray alloc] init];
            [a addObject: @1];
            [a addObject: @{ @"foo" : @"bar" }];
            [a addObject: @("blah")];
            a[0] = @2;
            NSLog(@"%@", a);
        }
    }
&lt;/pre&gt;

&lt;p&gt;I also added some debug logs to &lt;code&gt;GetImplementation&lt;/code&gt; to make sure it actually got called, in case I screwed up the build and ended up calling the runtime's implementation by mistake. Everything worked, and even the literals and subscripting called the replacement implementation.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;At its core, &lt;code&gt;objc_msgSend&lt;/code&gt; is relatively simple. The way that it's used requires the use of assembly code, however, which makes it more difficult to understand than it really needs to be. Additionally, the extreme performance demands and requisite optimizations mean that it's pretty dense and tricky assembly. However, by building a simple assembly trampoline and then reimplementing the logic in C, we can see just how it works, and there really isn't all that much to it.&lt;/p&gt;

&lt;p&gt;This should be obvious, but never ship your own &lt;code&gt;objc_msgSend&lt;/code&gt; in your own app. You'll break stuff and you'll be sorry. Do this for educational purposes only.&lt;/p&gt;

&lt;p&gt;That's it for today's hallucinatory, assembly-soaked article. Come back next time for more fun, games, and hacking. As I've said roughly one thousand times by now, but can't help but reminding you, Friday Q&amp;amp;A is driven by reader suggestions. If you have a topic that you'd like to see me write about, please &lt;a href="mailto:mike@mikeash.com"&gt;send it in&lt;/a&gt;!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-11-16-lets-build-objc_msgsend.html</guid><pubDate>Fri, 16 Nov 2012 14:27:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-11-09: dyld: Dynamic Linking On OS X
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-11-09-dyld-dynamic-linking-on-os-x.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-11-09: dyld: Dynamic Linking On OS X
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 11 09  15 51"
                  tags="dyld assembly macho fridayqna guest link linking"
            author="Gwynne Raskind"
            authorlink="http://blog.darkrainfall.org/"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-11-09: dyld: Dynamic Linking On OS X
&lt;/div&gt;
              &lt;p&gt;In the course of a recent job interview, I had an opportunity to study some of the internals of &lt;code&gt;dyld&lt;/code&gt;, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;b&gt;&lt;strong&gt;WARNING&lt;/strong&gt;&lt;/b&gt;&lt;br&gt;&lt;em&gt;Because the precise details of how &lt;code&gt;dyld&lt;/code&gt; works are quite complicated and change frequently, and because I don't yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you're curious about the particulars, I strongly recommend &lt;code&gt;dyld&lt;/code&gt;'s source code, which is publicly available at &lt;a href="http://opensource.apple.com"&gt;http://opensource.apple.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Static linking&lt;/b&gt;&lt;br&gt;So, let's start by talking about static linking, generally referred to simply as 'linking'. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are 'linked' together into a single binary file.&lt;/p&gt;

&lt;p&gt;Why does static linking matter to dynamic linking? Because the static linker, &lt;code&gt;ld&lt;/code&gt; (and &lt;code&gt;ld64&lt;/code&gt;) is responsible for transforming symbol references in your source code into indirect symbol lookups for &lt;code&gt;dyld&lt;/code&gt; to use later. Here's a very simple example:&lt;/p&gt;

&lt;pre&gt;    // This is the actual full declaration of main() on OS X. The "apple"
    //  parameter is the path to the executable, i.e. _NSGetProgname().
    int main(int argc, char **argv, char **envp, char **apple)
    {
        puts("Hello, world!\n");
        return 0;
    }
&lt;/pre&gt;

&lt;p&gt;The (optimized) assembly for this, as generated by &lt;code&gt;clang -S test.c -o test.s -Os&lt;/code&gt; and stripped of a bit of debug info, is:&lt;/p&gt;

&lt;pre&gt;            .section        __TEXT,__text,regular,pure_instructions
            .globl  _main
    _main:                                  ## @main
            pushq   %rbp
            movq    %rsp, %rbp
            leaq    L_str(%rip), %rdi
            callq   _puts
            xorl    %eax, %eax
            popq    %rbp
            ret
            .section        __TEXT,__cstring,cstring_literals
    L_str:                                  ## @str
            .asciz  "Hello, world!"
&lt;/pre&gt;

&lt;p&gt;Seems straightforward enough. Let's compile it into an object file and dump the fully compiled version (&lt;code&gt;clang -c test.c -o test.o -Os&lt;/code&gt;, &lt;code&gt;otool -tv test.o&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;    _main:
    0000000000000000        pushq   %rbp
    0000000000000001        movq    %rsp,%rbp
    0000000000000004        leaq    0x00000000(%rip),%rdi
    000000000000000b        callq   0x00000010
    0000000000000010        xorl    %eax,%eax
    0000000000000012        popq    %rbp
    0000000000000013        ret
&lt;/pre&gt;

&lt;p&gt;Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the &lt;code&gt;leaq&lt;/code&gt; instruction, the result is a load from the current value of &lt;code&gt;rip&lt;/code&gt;. The &lt;code&gt;callq&lt;/code&gt; instruction is a "signed offset" jump, which means that the offset of 0 calls the very next instruction in the code (address &lt;code&gt;0x10&lt;/code&gt; in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (&lt;code&gt;otool -r test.o&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;    Relocation information (__TEXT,__text) 2 entries
    address  pcrel length extern type    scattered symbolnum/value
    0000000c 1     2      1      2       0         4
    00000007 1     2      1      1       0         0
&lt;/pre&gt;

&lt;p&gt;The first entry says, "At offset &lt;code&gt;0xc&lt;/code&gt; in the &lt;code&gt;__TEXT,__text&lt;/code&gt; section, there is an unscattered, external, PC-relative &lt;code&gt;X86_64_RELOC_BRANCH&lt;/code&gt; reference of length 'long word' to the symbol at index 4 in the symbol table." A peek at the symbol table (&lt;code&gt;nm -ap&lt;/code&gt;) gives us:&lt;/p&gt;

&lt;pre&gt;    0000000000000014 s L_str
    0000000000000048 s EH_frame0
    0000000000000000 T _main
    0000000000000060 S _main.eh
                     U _puts
&lt;/pre&gt;

&lt;p&gt;The symbol at index 4 (the fifth entry) is &lt;code&gt;_puts&lt;/code&gt;. Similarly, the symbol at index 0 is &lt;code&gt;L_str&lt;/code&gt;, which will be relocated at offset &lt;code&gt;0x7&lt;/code&gt; of the object file (three bytes into the &lt;code&gt;leaq&lt;/code&gt; instruction). Finally, let's look at the result of linking this object into an executable (&lt;code&gt;clang test.c -o test -Os&lt;/code&gt;, &lt;code&gt;otool -tv test&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;    _main:
    0000000100000f36        pushq   %rbp
    0000000100000f37        movq    %rsp,%rbp
    0000000100000f3a        leaq    0x00000029(%rip),%rdi
    0000000100000f41        callq   0x100000f4a
    0000000100000f46        xorl    %eax,%eax
    0000000100000f48        popq    %rbp
    0000000100000f49        ret
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;ld&lt;/code&gt; has:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Located the &lt;code&gt;__TEXT&lt;/code&gt; segment at the standard executable load address for &lt;code&gt;x86_64&lt;/code&gt;, &lt;code&gt;0x0000000100000000&lt;/code&gt;, and the &lt;code&gt;__TEXT,__text&lt;/code&gt; section at &lt;code&gt;0xf36&lt;/code&gt; after that. The first &lt;code&gt;0xf35&lt;/code&gt; (actually, &lt;code&gt;0xa0f&lt;/code&gt;, since the larger offset doesn't account for the file's Mach-O header) bytes of &lt;code&gt;__TEXT&lt;/code&gt; are zeroed out. This aligns the &lt;code&gt;__TEXT&lt;/code&gt; segment flush up against the &lt;code&gt;__DATA&lt;/code&gt; segment. I don't know exactly why this is done, though I assume it has something to do with cache efficiency.&lt;/li&gt;
&lt;li&gt;Replaced &lt;code&gt;0&lt;/code&gt; with the actual offset from the &lt;code&gt;leaq&lt;/code&gt; instruction to the &lt;code&gt;L_str&lt;/code&gt; symbol, which in this case is &lt;code&gt;0x29&lt;/code&gt;. The resulting address is &lt;code&gt;0x100000f61&lt;/code&gt;, which a peek at the load commands (&lt;code&gt;otool -l test&lt;/code&gt;) tells us is the exact beginning of the &lt;code&gt;__TEXT,__cstring&lt;/code&gt; section.&lt;/li&gt;
&lt;li&gt;Replaced &lt;code&gt;0&lt;/code&gt; with the address of the symbol &lt;em&gt;stub&lt;/em&gt; for &lt;code&gt;puts()&lt;/code&gt;, which comes immediately after &lt;code&gt;main&lt;/code&gt;. Another peek at the load commands puts this in the &lt;code&gt;__TEXT,__stubs&lt;/code&gt; section, which we'll look at in detail later.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity's sake I won't go into it here.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;What does &lt;code&gt;dyld&lt;/code&gt; do, anyway?&lt;/b&gt;&lt;br&gt;&lt;code&gt;dyld&lt;/code&gt; is actually responsible for quite a bit of work, all told. It (in roughly this order):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bootstraps itself based on the very simple raw stack set up for the process by the kernel.&lt;/li&gt;
&lt;li&gt;Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process' memory space, including any necessary perusal of search paths from both the environment and the executable's "runpaths".&lt;/li&gt;
&lt;li&gt;Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.&lt;/li&gt;
&lt;li&gt;Runs static initializers for the executable.&lt;/li&gt;
&lt;li&gt;Sets up the parameters to the executable's &lt;code&gt;main&lt;/code&gt; function and calls it.&lt;/li&gt;
&lt;li&gt;During the process' execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the &lt;code&gt;dl*()&lt;/code&gt; API), and provides hooks for &lt;code&gt;gdb&lt;/code&gt; and other debuggers to get critical information.&lt;/li&gt;
&lt;li&gt;Runs static terminator routines after &lt;code&gt;main&lt;/code&gt; returns.&lt;/li&gt;
&lt;li&gt;In some scenarios, makes the required call to &lt;code&gt;libSystem&lt;/code&gt;'s &lt;code&gt;_exit&lt;/code&gt; routine once &lt;code&gt;main&lt;/code&gt; returns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'll examine each step roughly in order.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Bootstrap&lt;/b&gt;&lt;br&gt;&lt;code&gt;dyld&lt;/code&gt; is the very first code run in a new process. In particular, a symbol by the very descriptive name of &lt;code&gt;__dyld_start&lt;/code&gt; is called. This happens due to a bit of magic in the kernel which notices the &lt;code&gt;LC_LOAD_DYLINKER&lt;/code&gt; load command in the main executable and uses the given dynamic linker's entry symbol as the process' initial instruction pointer. &lt;code&gt;__dyld_start&lt;/code&gt; performs the following pseudocode (the actual implementation is a compact bit of assembly code):&lt;/p&gt;

&lt;pre&gt;    noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
    {
        stack push 0 // debugger end of frames marker
        stack align 16 // SSE align stack
        uint64_t slide = __dyld_start - __dyld_start_static;
        void *glue = NULL;
        void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &amp;amp;glue);
        if (glue)
            push glue // pretend the return address is a glue routine in dyld
        else
            stack restore // undo stack stuff we did before
        goto *entry(argc, argv, envp, apple); // never returns
    }
&lt;/pre&gt;

&lt;p&gt;In retrospect, I'm not sure that pseudocode is any more sensible than the assembly would have been, but let's walk through it quickly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Push a 0 onto the stack, and align the stack to SSE requirements.&lt;/li&gt;
&lt;li&gt;Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of &lt;code&gt;__dyld_start&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;dyld&lt;/code&gt;'s actual bootstrap routine, which sets up some minimal state for &lt;code&gt;dyld&lt;/code&gt; itself (such as pulling in certain functions from &lt;code&gt;libSystem&lt;/code&gt; without actually linking to it and setting up Mach messaging) and then runs &lt;code&gt;dyld&lt;/code&gt;'s real &lt;code&gt;main&lt;/code&gt; routine, which does loading, linking, and initializers.&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;dyld&lt;/code&gt; detected that the main executable uses the &lt;code&gt;LC_MAIN&lt;/code&gt; load command to set up its entry point, it returns the address of a glue routine which is responsible for calling &lt;code&gt;_exit&lt;/code&gt; when the process is done. That address is pushed onto the stack, fooling the entry point into thinking it's the routine's return address; the &lt;code&gt;ret&lt;/code&gt; instruction at the end of that function will jump to that glue code.&lt;/li&gt;
&lt;li&gt;If, on the other hand, &lt;code&gt;dyld&lt;/code&gt; detected the executable using the older &lt;code&gt;LC_UNIXTHREAD&lt;/code&gt; load command, it simply restores the stack to its original state and jumps to that entry point, which will be the &lt;code&gt;start&lt;/code&gt; routine from crt1.o, the C runtime. The C runtime basically redoes all the work that &lt;code&gt;__dyld_start&lt;/code&gt; just did, minus the actual &lt;code&gt;dyld&lt;/code&gt; startup, which is one of the reasons it was replaced with the &lt;code&gt;LC_MAIN&lt;/code&gt; command.&lt;/li&gt;
&lt;li&gt;Jump to the entry point.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;b&gt;Loading&lt;/b&gt;&lt;br&gt;Each time &lt;code&gt;dyld&lt;/code&gt; has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, "linking" sure has a lot of different uses, doesn't it?)&lt;/p&gt;

&lt;p&gt;Locating the correct binary on disk is &lt;em&gt;usually&lt;/em&gt; fairly simple. The &lt;code&gt;LC_LOAD_DYLIB&lt;/code&gt; command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells &lt;code&gt;dyld&lt;/code&gt; to look somewhere else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@executable_path&lt;/code&gt; - Up to OS X 10.3, this was the only marker &lt;code&gt;dyld&lt;/code&gt; supported, and it had rather limited utility. &lt;code&gt;dyld&lt;/code&gt; will replace this marker with the full path to the main executable.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@loader_path&lt;/code&gt; - Added in 10.4, this marker is replaced with the full path to the binary which loaded the binary that is currently being loaded. This is not always the main executable, and primarily enabled frameworks to themselves embed frameworks without resorting to the "umbrella framework" mechanism, which Apple never made entirely public and actively discouraged the use of.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@rpath&lt;/code&gt; - When this marker was added in 10.5, there was much rejoicing. This marker is replaced in sequence with each "run path" embedded in the binary's loading binaries (recursively), enabling frameworks and dynamic libraries to finally be built only once and be used for both system-wide installation and embedding without changes to their install names, and allowing applications to provide alternate locations for a given library, or even override the location specified for a deeply embedded library.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Linking&lt;/b&gt;&lt;br&gt;Once a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.&lt;/p&gt;

&lt;p&gt;At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It's not complicated; a lazy symbol's binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.&lt;/p&gt;

&lt;p&gt;Conceptually, binding a symbol is simple. In practice, it's rather interesting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Look up, in the binding information of the &lt;code&gt;__LINKEDIT&lt;/code&gt; segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for &lt;code&gt;_puts&lt;/code&gt; was at &lt;code&gt;0xf4a&lt;/code&gt; (plus some, I'm shortening for simplicity's sake!). If we were to disassemble the machine code at that address, we would get:&lt;/p&gt;

&lt;pre&gt;    Contents of (__TEXT,__stubs) section
    0000000100000f4a        jmp     *0x000000c0(%rip)
    Contents of (__TEXT,__stub_helper) section
    0000000100000f50        leaq    0x000000b1(%rip),%r11
    0000000100000f57        pushq   %r11
    0000000100000f59        jmp     *0x000000a1(%rip)
    0000000100000f5f        nop
    0000000100000f60        pushq   $0x00000000
    0000000100000f65        jmp     0x100000f50
&lt;/pre&gt;

&lt;p&gt;Wow, a nice simple jump instruction! Unfortunately, it's not &lt;em&gt;quite&lt;/em&gt; as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is...&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Look up, also in the binding information, the address of the symbol pointer for &lt;code&gt;puts&lt;/code&gt; in the &lt;code&gt;__DATA,__nl_symbol_ptr&lt;/code&gt; section. If this is a lazy symbol, look it up in the &lt;code&gt;__DATA,__la_symbol_ptr&lt;/code&gt; section instead. In our example executable, these sections look simply like this (using a hybrid of &lt;code&gt;otool&lt;/code&gt;'s output):&lt;/p&gt;

&lt;pre&gt;    Contents of (__DATA,__nl_symbol_ptr) section
    0000000100001000        dq      0x0000000000000000
    0000000100001008        dq      0x0000000000000000
    Contents of (__DATA,__la_symbol_ptr) section
    0000000100001010        dq      0x0000000100000f60
&lt;/pre&gt;

&lt;p&gt;In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;Update the address of the symbol pointer in the appropriate &lt;code&gt;__DATA&lt;/code&gt; section to the real address of the symbol in the loaded library. You're done!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So what, you may be asking, are all this crazy indirection and all these extra sections all about?&lt;/p&gt;

&lt;p&gt;Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can't put writable data in the &lt;code&gt;__TEXT&lt;/code&gt; section, which is executable code. This means you can't update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can't put executable code in the &lt;code&gt;__DATA&lt;/code&gt; section, which is writable data! So you can't just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.&lt;/p&gt;

&lt;p&gt;All this is true of lazily-bound symbols as well, but with a few caveats. &lt;code&gt;dyld&lt;/code&gt; does &lt;em&gt;not&lt;/em&gt; immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn't a simple 0, but rather points to the "stub helper". The stub helper is a bit of code embedded in the &lt;code&gt;__TEXT,__stub_helper&lt;/code&gt; section (really? who'd've guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for &lt;code&gt;dyld&lt;/code&gt;'s internal symbol binder. It doesn't show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to &lt;code&gt;dyld&lt;/code&gt;. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Static initializers, static terminators, and runtime services&lt;/b&gt;&lt;br&gt;Most of the interesting stuff has already happened at this point. &lt;code&gt;dyld&lt;/code&gt; will run any static initializers in the executable (most often constructors for global C++ objects and &lt;code&gt;+load&lt;/code&gt; methods for Objective-C classes, though there are also &lt;code&gt;__attribute__((constructor))&lt;/code&gt; functions for plain C). A list of initializers is stored in a separate &lt;code&gt;__DATA,__mod_init_func&lt;/code&gt; section in the binary, and is simply a set of addresses into the &lt;code&gt;__TEXT,__text&lt;/code&gt; section which &lt;code&gt;dyld&lt;/code&gt; calls in order of appearance. Initializer functions are passed the same arguments as &lt;code&gt;main&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When the process exits, &lt;code&gt;dyld&lt;/code&gt; will also run static terminators, which mostly means static destructors for C++ objects and &lt;code&gt;__attribute__((destructor))&lt;/code&gt; functions. These are handled just like static initializers, except that they're stored in &lt;code&gt;__DATA,__mod_term_func&lt;/code&gt; and take no parameters. Static terminators run in the same context as an &lt;code&gt;atexit()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Finally, &lt;code&gt;dyld&lt;/code&gt; provides runtime services to binaries it has loaded. The &lt;code&gt;dl*()&lt;/code&gt; APIs are the preferred interface to &lt;code&gt;dyld&lt;/code&gt;'s services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dlopen&lt;/code&gt; - Performs the load stage of loading a dynamic library, can optionally partially or completely perform the bind stage.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dlsym&lt;/code&gt; - Look up a symbol in a dynamic library (or the entire process). At its simplest, this is no more than a "name to address" lookup.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dladdr&lt;/code&gt; - The inverse of &lt;code&gt;dlsym&lt;/code&gt;, transforming an address into a set of symbol information.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dlclose&lt;/code&gt; - Unloads a dynamic library from the process, if no other handles to it are in use. Unloading invalidates all the symbols provided by the dynamic library and can be something of a touchy operation, particularly in an Objective-C environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;What's missing&lt;/b&gt;&lt;br&gt;While I've gone over quite a bit, I've also left out a &lt;em&gt;lot&lt;/em&gt; of information in this article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two-level namespaces, which prevent trivial symbol collisions in dynamic libraries&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;dyld&lt;/code&gt; shared cache, which maintains a systemwide map of already-loaded dynamic libraries for fast binding&lt;/li&gt;
&lt;li&gt;Rebasing&lt;/li&gt;
&lt;li&gt;Code signing&lt;/li&gt;
&lt;li&gt;Dynamic library linking&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dyld&lt;/code&gt;'s expansive set of environment variables&lt;/li&gt;
&lt;li&gt;"Restricted" binaries (particularly &lt;code&gt;setuid&lt;/code&gt; binaries)&lt;/li&gt;
&lt;li&gt;Most of the kernel's interaction with &lt;code&gt;dyld&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Compression and encryption in Mach-O binaries&lt;/li&gt;
&lt;li&gt;How &lt;code&gt;dyld&lt;/code&gt; itself is built&lt;/li&gt;
&lt;li&gt;Symbol interposing&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dyld&lt;/code&gt;'s operation on i386 and ARM, which is conceptually the same, but both architectures differ significantly in the details&lt;/li&gt;
&lt;li&gt;Details of the Mach-O binary format&lt;/li&gt;
&lt;li&gt;How "fat" binaries are handled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've left these out for two reasons: One, I was a bit behind when writing this article and just didn't have time to put it all in, and two, there really isn't space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and &lt;code&gt;dyld&lt;/code&gt; are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn't seem too interested in updating the documentation):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developer.apple.com/library/mac/#documentation/developertools/conceptual/MachOTopics/0-Introduction/introduction.html"&gt;Apple's Mach-O documentation&lt;/a&gt;&lt;br&gt;
&lt;a href="https://developer.apple.com/library/mac/#documentation/developertools/conceptual/MachORuntime/Reference/reference.html"&gt;Apple's Mach-O reference&lt;/a&gt;&lt;br&gt;
&lt;a href="file:///usr/include/mach-o/loader.h"&gt;The Mach-O "loader" header, a very good reference (also look at other files in the &lt;code&gt;mach-o/&lt;/code&gt; directory)&lt;/a&gt;
&lt;a href="https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html"&gt;Apple's dyld Reference&lt;/a&gt;&lt;br&gt;
&lt;a href="http://developer.apple.com/library/Mac/#documentation/Darwin/Reference/ManPages/man3/dlopen.3.html"&gt;The dlopen(3) manpage&lt;/a&gt;&lt;br&gt;
&lt;a href="http://developer.apple.com/library/mac/#releasenotes/DeveloperTools/RN-dyld/_index.html"&gt;dyld's Release Notes&lt;/a&gt;&lt;br&gt;
&lt;a href="http://opensource.apple.com/source/dyld/dyld-210.2.3/"&gt;dyld's source code as of 10.8.2&lt;/a&gt;&lt;br&gt;
&lt;a href="http://opensource.apple.com/source/xnu/xnu-2050.18.24/"&gt;Kernel source code as of 10.8.2 (look at &lt;code&gt;bsd/kern/kern_exec.c&lt;/code&gt; and &lt;code&gt;bsd/kern/mach_loader.c&lt;/code&gt; in particular)&lt;/a&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;&lt;code&gt;dyld&lt;/code&gt; is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and &lt;code&gt;dyld&lt;/code&gt; has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of &lt;code&gt;dyld&lt;/code&gt;, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you'll have a better idea of where to look for the problem. Then again, maybe not; &lt;code&gt;ld&lt;/code&gt; can be pretty obstructive.&lt;/p&gt;

&lt;p&gt;That's all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;</description><author>Gwynne Raskind</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-11-09-dyld-dynamic-linking-on-os-x.html</guid><pubDate>Fri, 09 Nov 2012 15:51:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-11-02: Building the FFT
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-11-02-building-the-fft.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-11-02: Building the FFT
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 11 02  14 29"
                  tags="fridayqa audio fft"
            author="Chris Liscio"
            authorlink="http://supermegaultragroovy.com"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-11-02: Building the FFT
&lt;/div&gt;
              &lt;p&gt;In the &lt;a href="friday-qa-2012-10-26-fourier-transforms-and-ffts.html"&gt;last post in this mini-series&lt;/a&gt;, Mike gave an overview of the Fourier Transform and then showed you how to use Apple's implementation of the Fast Fourier Transform (FFT).&lt;/p&gt;

&lt;p&gt;In this installment, I'll show you how we get from point A to point B. Specifically, I'll talk a bit about the magic behind the Fast Fourier Transform.&lt;script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"&gt;&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;A bit of math (British localization: Some maths)&lt;/b&gt;&lt;br&gt;Fourier Transforms have some interesting mathematical properties. Most importantly, Fourier Transforms are a &lt;em&gt;linear&lt;/em&gt; operation. That is, if we use &lt;script type="math/tex"&gt;\mathcal{F}(h(t))&lt;/script&gt; to denote the Fourier Transform of &lt;script type="math/tex"&gt;h(t)&lt;/script&gt;:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
\mathcal{F}(h_a(t) + h_b(t)) &amp;= \mathcal{F}(h_a(t)) + \mathcal{F}(h_b(t)) \text{; and} \\
\mathcal{F}(K \cdot h(t)) &amp;= K \cdot \mathcal{F}(h(t))
\end{align}
&lt;/script&gt;

&lt;p&gt;So whether you scale or add two signals in the time or frequency domain is up to you. That's pretty handy when you're working on signal processing code. But I digress.&lt;/p&gt;

&lt;p&gt;In addition to the above, there are some more interesting properties relating the time and frequency domain representations of a function. We'll use &lt;script type="math/tex"&gt;h(t) \Leftrightarrow H(f)&lt;/script&gt; to relate the time and frequency domain representations.&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
h(C \cdot t) &amp;\Leftrightarrow \frac{1}{|C|} \cdot H(\frac{f}{C}) &amp;&amp; \text{scaling time}\\
\frac{1}{C} \cdot h(\frac{t}{C}) &amp;\Leftrightarrow H(C \cdot f) &amp;&amp; \text{scaling frequency}\\
h(t-t_0) &amp;\Leftrightarrow H(f) \cdot e^{2 \pi i f \cdot t_0} &amp;&amp; \text{shifting time} \\
h(t) \cdot e^{-2 \pi i t \cdot f_0} &amp;\Leftrightarrow H(f - f_0) &amp;&amp; \text{shifting frequency} \\
\end{align}
&lt;/script&gt;

&lt;p&gt;Don't focus too much on the individual relations themselves. The main point that I'm trying to get across is that we can manipulate the Fourier Transform and the signal in meaningful ways, and relate those changes between domains.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The Discrete Fourier Transform&lt;/b&gt;&lt;br&gt;Before we continue, I'd like to make a clarification of sorts. The mathematical properties above are using terminology specific to Fourier Transforms of continuous functions defined over infinite time and frequency.&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
H(f) = \int_{t=-\infty}^{\infty} h(t) e^{2\pi ift} dt
\end{align}
&lt;/script&gt;

&lt;p&gt;Obviously we're working in a digital world, and we don't have the luxury of continuous signals to work with. In software, we're dealing with sampled data, which is where the Discrete Fourier Transform comes in.&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
H_k = \sum_{j=0}^{N-1} h_j e^{2\pi ijk/N}
\end{align}
&lt;/script&gt;

&lt;p&gt;This is basically what Mike already gave you in code form in his &lt;a href="friday-qa-2012-10-26-fourier-transforms-and-ffts.html"&gt;last post on the topic&lt;/a&gt;, and I encourage you to take another look at his code to understand how it relates to this equation.&lt;/p&gt;

&lt;p&gt;What's important is that you understand the Discrete Fourier Transform and Continuous Fourier Transforms are closely related, and have almost exactly the same mathematical properties as described above. For what we're discussing, you can safely ignore the "almost" part, and look to &lt;a href="http://en.wikipedia.org/wiki/Discrete_Fourier_transform"&gt;Wikipedia's definition&lt;/a&gt; for more discussion.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The Danielson-Lanczos lemma&lt;/b&gt;&lt;br&gt;Using a combination of the mathematical properties of the Fourier Transform above, Danielson and Lanczos discovered that you can rewrite a Discrete Fourier Transform of length N as a sum of two Discrete Fourier Transforms of length N/2: one from the even-numbered and one from the odd-numbered points of input.&lt;/p&gt;

&lt;p&gt;This is their proof:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
H_k &amp;= \sum_{j=0}^{N-1} e^{2\pi ijk/N} h_j \\
&amp;= \sum_{j=0}^{N/2-1} e^{2\pi ik(2j)/N} h_{2j} +  \sum_{j=1}^{N/2-1} e^{2\pi ik(2j+1)/N} h_{2j+1} \\
&amp;= \sum_{j=0}^{N/2-1} e^{2\pi ikj/(N/2)} h_{2j} + W^k \sum_{j=0}^{N/2-1} e^{2\pi ikj/(N/2)} h_{2j+1} \\
&amp;= H_k^e + W^k H_k^o
\end{align}
&lt;/script&gt;

&lt;p&gt;Here, &lt;script type="math/tex"&gt;W^k = e^{2\pi i k/N}&lt;/script&gt;, and &lt;script type="math/tex"&gt;H_k^{e}&lt;/script&gt; and &lt;script type="math/tex"&gt;H_k^o&lt;/script&gt; are the even and odd terms of &lt;script type="math/tex"&gt;H_k&lt;/script&gt;, respectively.&lt;/p&gt;

&lt;p&gt;Again, it's not important to totally understand the above, or how they got from A to B. This is where I wave my hands and defer to the fact that the mathematical properties of the Discrete Fourier Transform have been used in the derivation.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;The Fast Fourier Transform&lt;/b&gt;&lt;br&gt;The discovery by Danielson and Lanczos, combined with the fact that it can be applied recursively, is the basis of the Fast Fourier Transform. Breaking the problem down one more step, we will end up with a combination of &lt;script type="math/tex"&gt;H_k^{ee}&lt;/script&gt;, &lt;script type="math/tex"&gt;H_k^{eo}&lt;/script&gt;, &lt;script type="math/tex"&gt;H_k^{oe}&lt;/script&gt;, and &lt;script type="math/tex"&gt;H_k^{oo}&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;If we stick with power-of-two inputs to the Fourier Transform, we will guarantee that the problem continues to decompose until we reach a fourier transform on one element. And, guess what? That's  just a copy of the input value:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\begin{align}
H_k^{eeoeoe \dots eoe} = f_n &amp;&amp; \text{for some value of } n 
\end{align}
&lt;/script&gt;

&lt;p&gt;There is a way to derive the value of &lt;script type="math/tex"&gt;n&lt;/script&gt;, but I'm choosing to wave my hands again.&lt;/p&gt;

&lt;p&gt;Instead, I'm going to let recursion do the work for us. I vote for this option, to keep this post from exploding out of control. Also, get your hands on a copy of &lt;a href="http://www.nr.com"&gt;Numerical Recipes&lt;/a&gt; to really go deep.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;(Fairly) Straightforward Implementation&lt;/b&gt;&lt;br&gt;I put together a compact implementation that demonstrates how this all works. It is based on the first optimization, because I think it's a good balance of readability with just a hint of cleverness.  (I started out based on &lt;a href="http://en.literateprograms.org/Cooley-Tukey_FFT_algorithm_%28C%29"&gt;this resource&lt;/a&gt;, but massaged the implementation for clarity, and to closely match my math above.)&lt;/p&gt;

&lt;pre&gt;    static complex double *FFT_recurse( complex double *x, int N, int skip ) {
        complex double *X = (complex double*)malloc( sizeof(complex double) * N );
        complex double *O, *E;

        // We've hit the scalar case, and copy the input to the output.
        if ( N == 1 ) {
            X[0] = x[0];
            return X;
        }

        E = FFT_recurse( x, N/2, skip * 2 );
        O = FFT_recurse( x + skip, N/2, skip * 2 );

        for ( int k = 0; k &amp;lt; N / 2; k++ ) {
            O[k] = ( cexp( 2.0 * I * M_PI * k / N ) * O[k] );
        }

        // While E[k] and O[k] are of length N/2, and X[k] is of length N, E[k] and
        // O[k] are periodic in k with length N/2. See p.609 of Numerical Recipes
        // in C (3rd Ed, 2007). [CL]
        for ( int k = 0; k &amp;lt; N / 2; k++ ) {
            X[k] = E[k] + O[k];
            X[k + N/2] = E[k] + O[k];
        }

        free( O );
        free( E );

        return X;
    }

    complex double *FFT( complex double *x, int N ) {
        return FFT_recurse( x, N, 1 );
    }
&lt;/pre&gt;

&lt;p&gt;It's really not that complicated, but the improvement in performance is immense. I put together some driver code and &lt;a href="https://github.com/liscio/fft"&gt;tossed it all up on my github account&lt;/a&gt;. Some simple timings revealed that a straightforward "math definition" of the algorithm took approximately 12.1s, and the FFT implementation above took a mere 0.1s. More than a 100x speed increase, and that's with a whole bunch of &lt;code&gt;malloc()&lt;/code&gt;s and &lt;code&gt;free()&lt;/code&gt;s strewn about!&lt;/p&gt;

&lt;p&gt;&lt;b&gt;In closing&lt;/b&gt;&lt;br&gt;I hope that this explanation was somewhat helpful in demystifying the Fast Fourier Transform and how it works. It's one of many examples of algorithms that exploit mathematics in order to gain an order-of-magnitude speedup.&lt;/p&gt;

&lt;p&gt;Oh, and in case Mike didn't make it clear, you should &lt;em&gt;never implement this yourself&lt;/em&gt;. Use the vDSP routines in Accelerate.framework!&lt;/p&gt;

&lt;p&gt;Apple's performance team continues to push the limits of their FFT implementation year after year on all platforms. A combination of mathematicians, physicists, engineers, scientists, and assembly language wizards are working hard to ensure that Accelerate.framework is always running as fast, and power-efficient as possible.&lt;/p&gt;

&lt;p&gt;I'm of the opinion that Apple's performance team is largely responsible for what makes the "cool stuff" on the iPhone possible. Think about live audio effects in Garage Band, video effects in iMovie, the processing in iPhoto, and so forth. All of that stuff is depending on Accelerate.framework in some capacity.&lt;/p&gt;

&lt;p&gt;The next time you visit WWDC, make some time to stop by their lab with any performance questions you may have that relate to Accelerate.framework. They're really nice folks, and astoundingly smart, too!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Chris Liscio</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-11-02-building-the-fft.html</guid><pubDate>Fri, 02 Nov 2012 14:29:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-10-26: Fourier Transforms and FFTs
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-10-26-fourier-transforms-and-ffts.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-10-26: Fourier Transforms and FFTs
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 10 26  13 25"
                  tags="fridayqa audio coreaudio"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-10-26: Fourier Transforms and FFTs
&lt;/div&gt;
              &lt;p&gt;&lt;a href="friday-qa-2012-10-12-obtaining-and-interpreting-audio-data.html"&gt;Last time around&lt;/a&gt; I discussed the basics of audio data, how to get it, and how to understand it. Today, I'm going to go into some detail about one of the fundamental tools for more complex audio analysis, the fourier transform, and the FFT algorithm that makes it practical to use on computers.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Theory&lt;/b&gt;&lt;br&gt;Humans largely experience audio in the frequency domain. If you plot the waveform of a guitar chord, it's a messy thing, yet we experience it as a combination of individual notes, or frequencies. Analyzing audio by frequencies is tremendously useful, not in the least because it matches up much better with how we experience sound. Computers generally consider audio to be a collection of individual samples, while people consider it to be a collection of individual frequencies. Going from individual samples to frequencies is, essentially, the Fourier transform.&lt;/p&gt;

&lt;p&gt;Imagine that you have an audio waveform that you think has a 200Hz frequency component in it. However, you're not sure, and you don't know how loud it is. How can you check?&lt;/p&gt;

&lt;p&gt;First, let's slow the audio down by a factor of 1000 so this 200Hz frequency is on a more human timescale. With that factor of 1000 slowdown, a 200Hz frequency is a waveform that goes up, down, and back to the starting point every five seconds. The frequency is now far too low to hear, but easy to see, and convenient to analyze. This is a bit like how computers deal with audio, since they're fast and can devote a lot of attention to every individual audio sample.&lt;/p&gt;

&lt;p&gt;The mystery audio has a complex waveform which goes up and down in what seems like a nearly random fashion. But you suspect this 200Hz pattern, or a consistent up and down every five seconds, embedded in the rest of the action.&lt;/p&gt;

&lt;p&gt;You could check for this by just tapping out a beat once every five seconds, and checking where the waveform is at each beat. If it's consistently up, even after you've done this for a couple of minutes, then there's clearly some consistent once-every-five-seconds signal in there.&lt;/p&gt;

&lt;p&gt;There's a problem with this, though. The 200Hz frequency crosses zero on its way between the hills and valleys of the wave. What if your beat happens to line up with those zero crossings? You'll get nothing, even though the signal is present. You'd have to repeat this simple analysis several times to make sure you didn't end up out of phase with the signal.&lt;/p&gt;

&lt;p&gt;It would be better to have a single analysis that works with any phase, and has no chance of consistently hitting zero crossings. You want something that lines up with this once-every-five-seconds period, but still operates constantly. In short, you want a circle.&lt;/p&gt;

&lt;p&gt;Get the waveform going so you can see it. It's moving up, down, back up, etc. Now, imagine you stand up and start turning in place. Make one complete turn once every five seconds.&lt;/p&gt;

&lt;p&gt;Now start walking. Adjust your speed to match the waveform at any particular moment. If the wave is high, walk forward quickly. If it's near zero, walk slowly. If it's at zero, stop. When it's below zero, walk backwards. And keep turning at a rate of one complete turn every five seconds.&lt;/p&gt;

&lt;p&gt;If you keep this up for a couple of minutes, you'll end up with a solid analysis of the 200Hz signal, or lack thereof, in the audio you're analyzing. Because you're turning at a rate that exactly matches the frequency you're looking for, any audio at that frequency will consistently move you in a single direction. As you turn, the 200Hz wave will crest at exactly the same point around the turn each time, and you'll begin to wander away from where you started.&lt;/p&gt;

&lt;p&gt;All the parts of the sound that aren't at 200Hz will move you around too, of course. But because they don't match the speed at which you turn, they end up getting spread out all along the circle, and the net result is to bring you back to the center. Only the 200Hz component will consistently move you away. How far away you get tells you how strong that component is. The direction you move in tells you where the wave started, or its phase.&lt;/p&gt;

&lt;p&gt;Do this over and over again, and you can characterize the whole sound. For example, you might do this at 20Hz, 40Hz, 60Hz, 80Hz, 100Hz, etc., all the way up to 20,000Hz or so. At the end, you'll have completely mapped out the strength and phase of all of the frequency components of the sound. A sound can be thought of as a sum of various sine waves, or pure tones. Any sound, or indeed any signal at all, can be represented as a collection of sine waves with the right frequency and phase. By mapping out the frequencies in the sound, you build up the fundamental component sine waves it contains.&lt;/p&gt;

&lt;p&gt;This map is the Fourier transform of the sound. You spin the sound around in a circle at different frequences and come up with all of its components, and get the Fourier transform. You've now gone from a collection of individual samples, which tell you very little, to a comprehensive map of the fundamental components of the audio, which tell you a lot. The Fourier transform is, essentially, what your ear does with incoming sound, and we fundamentally hear sound by frequencies, not samples, so it's a much better match for human perception.&lt;/p&gt;

&lt;p&gt;You can also take the Fourier transform and reverse it to obtain the original audio. If you modify the transformed frequency components, that modification shows up in the original audio. This can be used to reduce or increase the volume of certain frequency components, or perform other interesting changes on the original audio.&lt;/p&gt;

&lt;p&gt;The Fast Fourier Transform, or FFT, is pretty much what it implies. It's an algorithm for quickly computing the Fourier transform of a sequence of values. While FFT just refers to the algorithm, and Fourier transform is the actual result that it produces, FFT is often used pretty much interchangeably for both.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Basic Implementation&lt;/b&gt;&lt;br&gt;Let's build a basic Fourier transform. The fundamental operation is the rotation procedure I described above, so let's build a function to perform that:&lt;/p&gt;

&lt;pre&gt;    COMPLEX Rotate(float *buf, int samples, float hz, float rate)
    {
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;COMPLEX&lt;/code&gt; is a data type defined in Apple's &lt;code&gt;vDSP&lt;/code&gt; library that simply has two components, &lt;code&gt;real&lt;/code&gt; and &lt;code&gt;imag&lt;/code&gt;. The output of a Fourier transform is actually a sequence of complex numbers, which correspond to the two-dimensional result of the rotation procedure. The Fourier transform is actually defined in terms of &lt;a href="http://en.wikipedia.org/wiki/Exponentiation#Complex_exponents_with_positive_real_bases"&gt;complex number exponents&lt;/a&gt;, which conceptually maps to rotation in two dimensions.&lt;/p&gt;

&lt;p&gt;The goal is to add all of the various impulses together, so we want &lt;code&gt;COMPLEX&lt;/code&gt; variable to serve as an accumulator:&lt;/p&gt;

&lt;pre&gt;        COMPLEX result = { 0, 0 };
&lt;/pre&gt;

&lt;p&gt;Then loop over all of the samples:&lt;/p&gt;

&lt;pre&gt;        for(int i = 0; i &amp;lt; samples; i++)
        {
&lt;/pre&gt;

&lt;p&gt;We want to rotate to make a complete circle once every &lt;code&gt;rate / hz&lt;/code&gt; samples. The following line computes the appropriate angle (in radians) for the current sample:&lt;/p&gt;

&lt;pre&gt;            float angle = i * hz * 2 * M_PI / rate;
&lt;/pre&gt;

&lt;p&gt;Now we use that angle to compute the step to take. This is simply the value of &lt;code&gt;buf[i]&lt;/code&gt;, rotated by &lt;code&gt;angle&lt;/code&gt;, which is a simple bit of trigonometry:&lt;/p&gt;

&lt;pre&gt;            float real = buf[i] * cos(angle);
            float imag = buf[i] * sin(angle);
&lt;/pre&gt;

&lt;p&gt;We then add that step into the accumulator, and continue with the loop:&lt;/p&gt;

&lt;pre&gt;            result.real += real;
            result.imag += imag;
        }
&lt;/pre&gt;

&lt;p&gt;Once the loop is done, just return the accumulated result:&lt;/p&gt;

&lt;pre&gt;        return result;
    }
&lt;/pre&gt;

&lt;p&gt;Note that, for reasons I don't fully understand, the results of this function are off by a factor of two and sometimes with inverted components compared to a reference FFT function's output. However, this still does fine in illustrating the principle and getting usable data out.&lt;/p&gt;

&lt;p&gt;To perform a Fourier transform, we then just call &lt;code&gt;Rotate&lt;/code&gt; repeatedly with different frequencies. Which frequencies? The lowest frequency the Fourier transform can extract (aside from zero) is equal to the sample rate divided by the number of samples, which is the frequency where one complete waveform can fit within the samples given. The transform produces &lt;code&gt;samples / 2&lt;/code&gt; bins of output, where each bin is a complex number. (Technically, it produces &lt;code&gt;samples&lt;/code&gt; bins of output, but with real-valued input, the output is symmetric and half of it can be neglected.) The first bin (bin &lt;code&gt;0&lt;/code&gt;) corresponds to a frequency of 0Hz, also called the DC offset. The second bin corresponds to &lt;code&gt;rate / samples&lt;/code&gt;Hz. The third bin corresponds to &lt;code&gt;2 * rate / samples&lt;/code&gt;Hz, and so on like that, with each bin's frequency equal to &lt;code&gt;i * rate / samples&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Given that, here is a quick routine to compute the Fourier transform of a collection of samples:&lt;/p&gt;

&lt;pre&gt;    int samples = ...;
    float *buf = ...;

    COMPLEX result[samples / 2];
    for(int i = 0; i &amp;lt; samples / 2; i++)
        result[i] = Rotate(buf, samples, i * rate / samples, rate);
&lt;/pre&gt;

&lt;p&gt;This is great code to play with and makes it a lot easier to understand what's happening when you generate a Fourier transform. However, it's impractical for any real use, as it's &lt;em&gt;far&lt;/em&gt; too slow. For speed, we want an FFT implementation, and Apple has a good one.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;vDSP FFT&lt;/b&gt;&lt;br&gt;Apple's vDSP provides optimized digital signal processing routines, including a full suite of FFT functions. For those wondering, "DSP" stands for digital signal processing, and the "v" stands for vector, indicating that it will use your CPU's vector unit whenever possible for best speed.&lt;/p&gt;

&lt;p&gt;The FFT functions need some initial setup done which can be reused across multiple calls. To reduce overhead, you perform the setup separately, then use the resulting data as many times as you want before destroying it. The &lt;code&gt;vDSP_create_fftsetup&lt;/code&gt; function handles this. It takes the maximum amount of data you want to work with, and a radix. The data is specified as a power of two, so to specify that you want to work with &lt;code&gt;1024&lt;/code&gt; samples at a time, pass in &lt;code&gt;10&lt;/code&gt;. For the radix, pass &lt;code&gt;kFFTRadix2&lt;/code&gt;. vDSP supports radix values of &lt;code&gt;3&lt;/code&gt; and &lt;code&gt;5&lt;/code&gt;, which allows it to work with data whose length is a multiple of &lt;code&gt;3&lt;/code&gt; or &lt;code&gt;5&lt;/code&gt;, but this is not generally useful. Powers of two are convenient in programming, so we'll stick with that. Here is the code to set up the vDSP FFT for &lt;code&gt;1024&lt;/code&gt; samples:&lt;/p&gt;

&lt;pre&gt;    int bufferFrames = 1024;
    int bufferlog2 = round(log2(bufferFrames));
    FFTSetup fftSetup = vDSP_create_fftsetup(bufferlog2, kFFTRadix2);
&lt;/pre&gt;

&lt;p&gt;Next, the input data has to be transformed into the arrangement that vDSP expects. The output is provided as two distinct arrays, one for the real parts and one for the imaginary parts. To keep things simple (for vDSP, not for us), the input is also split into two arrays, and the FFT performed in place. Here are the two arrays, plus a little structure to hold them:&lt;/p&gt;

&lt;pre&gt;    float outReal[bufferFrames / 2];
    float outImaginary[bufferFrames / 2];
    COMPLEX_SPLIT out = { .realp = outReal, .imagp = outImaginary };
&lt;/pre&gt;

&lt;p&gt;The input buffer needs to be transformed so that all of the even-numbered elements go into &lt;code&gt;outReal&lt;/code&gt;, and all of the odd-numbered elements go into &lt;code&gt;outImaginary&lt;/code&gt;. vDSP provides a convenient function to do this for us:&lt;/p&gt;

&lt;pre&gt;    vDSP_ctoz((COMPLEX *)data, 2, &amp;amp;out, 1, bufferFrames / 2);
&lt;/pre&gt;

&lt;p&gt;This function is intended to take an array of complex numbers as the input (thus the cast to &lt;code&gt;COMPLEX *&lt;/code&gt;) and produce a split array of the output. By squinting and pretending that the input data is actually complex numbers, this accomplishes the even/odd split that we wanted.&lt;/p&gt;

&lt;p&gt;Performing the FFT itself is actually pretty simple. Just call &lt;code&gt;vDSP_fft_zrip&lt;/code&gt;, give it the &lt;code&gt;fftSetup&lt;/code&gt; object, the array to work on, the stride (how many elements to move forward for each iteration, usually &lt;code&gt;1&lt;/code&gt;) the data length (as a power of two), and specify that you want a forward FFT:&lt;/p&gt;

&lt;pre&gt;    vDSP_fft_zrip(fftSetup, &amp;amp;out, 1, bufferlog2, FFT_FORWARD);
&lt;/pre&gt;

&lt;p&gt;The same function can do both forward (from standard PCM audio to frequencies) and inverse (from frequencies back to raw audio) transforms, which is why this call has to specify &lt;code&gt;FFT_FORWARD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At this point, outReal and outImaginary contain the real and imaginary components of the FFT output. Their magnitude (&lt;code&gt;sqrt(real * real + imag * imag)&lt;/code&gt;) tells you how much energy is in a particular frequency bin. Note that the magnitude can be quite high even when the input is restricted to a range of [-1, 1], since very pure tones will deposit all of their energy into a single bin.&lt;/p&gt;

&lt;p&gt;Analyzing the magnitudes can produce for interesting visualizations. Since human hearing works similar to the FFT, the result corresponds much more with how we hear audio than a simple waveform generated from the PCM data.&lt;/p&gt;

&lt;p&gt;The output has one odd feature. Index &lt;code&gt;0&lt;/code&gt; in the output would normally contain the DC offset, which is always a pure real value with a zero imaginary component. Index &lt;code&gt;bufferFrames / 2&lt;/code&gt; contains the Nyquist frequency, which &lt;em&gt;also&lt;/em&gt;  is a pure real value with zero imaginary component. To save a bit of space, vDSP squashes these two together at index &lt;code&gt;0&lt;/code&gt;. &lt;code&gt;outReal[0]&lt;/code&gt; contains the DC offset, and &lt;code&gt;outImaginary[0]&lt;/code&gt; contains the Nyquist component.&lt;/p&gt;

&lt;p&gt;The FFT output can be altered to, for example, reduce or increase the strength of certain frequencies. You can then transform the result back into raw audio data, which will reflect the alterations. The alterations are just a matter of twiddling around with &lt;code&gt;outReal&lt;/code&gt; and &lt;code&gt;outImaginary&lt;/code&gt;. For example, if you go through and set the first &amp;lt;sup&amp;gt;1&amp;lt;/sup&amp;gt;/&amp;lt;sub&amp;gt;4&amp;lt;/sub&amp;gt;&amp;lt;sup&amp;gt;th&amp;lt;/sup&amp;gt; of each array to&lt;code&gt;0&lt;/code&gt;, you'll remove all of the low frequency components from the sound.&lt;/p&gt;

&lt;p&gt;To reverse the transform, just call &lt;code&gt;vDSP_fft_zrip&lt;/code&gt; again with &lt;code&gt;FFT_INVERSE&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    vDSP_fft_zrip(fftSetup, &amp;amp;out, 1, bufferlog2, FFT_INVERSE);
&lt;/pre&gt;

&lt;p&gt;Then de-interleave the data, using &lt;code&gt;vDSP_ztoc&lt;/code&gt; which just does the opposite of the &lt;code&gt;vDSP_ctoz&lt;/code&gt; call useb previously:&lt;/p&gt;

&lt;pre&gt;    float outData[bufferFrames];
    vDSP_ztoc(&amp;amp;out, 1, (COMPLEX *)outData, 2, bufferFrames / 2);
&lt;/pre&gt;

&lt;p&gt;The result of this inverse transformation, for reasons I don't fully understand, comes out multiplied by a factor of &lt;code&gt;bufferFrames * 2&lt;/code&gt;. This needs to be removed before treating the result as audio data, unless you like frightening everybody in the restaurant with the sudden blast of sound from your MacBook (ask me how I know this).&lt;/p&gt;

&lt;p&gt;A simple &lt;code&gt;for&lt;/code&gt; loop would do for this, but the &lt;code&gt;cblas_sscal&lt;/code&gt; function will do it faster and nearly as easily. This function is found in &lt;code&gt;Accelerate.framework&lt;/code&gt;, the umbrella framework which also contains vDSP and a bunch of other nifty functionality that I highly recommend you check out. This code simply multiplies every element in the output data by &lt;code&gt;1 / (bufferFrames * 2)&lt;/code&gt; to renormalize it:&lt;/p&gt;

&lt;pre&gt;    cblas_sscal(bufferFrames, 1.0 / (bufferFrames * 2), outData, 1);
&lt;/pre&gt;

&lt;p&gt;At this point, &lt;code&gt;outData&lt;/code&gt; now contains the audio that was originally in &lt;code&gt;data&lt;/code&gt;, except with whatever modifications made to the FFT data in between.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Fourier transforms are a deep and complicated subject, and this article barely scratches the surface. The interpretation and modification of the resulting data can get about as complex as you care to go. Just performing and understanding the basics of an FFT can be tough, though, and I hope I've jumped you over that initial hurdle today.&lt;/p&gt;

&lt;p&gt;It's time for everybody to go home now. But before you go, please don't forget to visit the suggestion box. Friday Q&amp;amp;A is driven by reader suggestions, in case you haven't already picked that up, so please &lt;a href="mailto:mike@mikeash.com"&gt;let me know&lt;/a&gt; if you have an idea for a topic that you'd like to see covered in a future article.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-10-26-fourier-transforms-and-ffts.html</guid><pubDate>Fri, 26 Oct 2012 13:25:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-10-12: Obtaining and Interpreting Audio Data
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-10-12-obtaining-and-interpreting-audio-data.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-10-12: Obtaining and Interpreting Audio Data
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 10 12  13 20"
                  tags="fridayqa audio coreaudio"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-10-12: Obtaining and Interpreting Audio Data
&lt;/div&gt;
              &lt;p&gt;Continuing the multimedia data processing trend, today I'm switching over to audio. A reader known as Derek suggested a discussion on how to obtain and interpret audio data, which is what I'll cover today.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Theory&lt;/b&gt;&lt;br&gt;Before we even get to the question of how computers represent sound, we first need to have an idea of just what sound &lt;em&gt;is&lt;/em&gt;, physically. Ultimately, sound is variation in pressure in a medium, usually air, over time.&lt;/p&gt;

&lt;p&gt;That variation can be represented as a function of pressure over time. However, the variations are small. If the function represents absolute pressure, then the variations made by the sound are nearly indistinguishable. If you graphed the function, you'd just see a flat line.&lt;/p&gt;

&lt;p&gt;Better, then, to represent the sound as a function of pressure over time relative to the normal, undisturbed pressure. The range of the function will vary a lot depending on many different factors, so while we're at it, let's just normalize it to be between -1 and 1. Do this, and you get a nice waveform, as can be seen in just about any audio editor.&lt;/p&gt;

&lt;p&gt;Computers can't represent arbitrary continuous functions, so it has to be discretized somehow. Rather than try to represent every point on the function, the computer simply samples it occasionally. How frequently it gathers those samples is called the &lt;em&gt;sample rate&lt;/em&gt;. The &lt;a href="http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem"&gt;Nyquist sampling theorem&lt;/a&gt; says that a signal sampled at a given rate can represent frequencies up to one half that rate. For example, sampling sound at a rate of 44100 samples/second, or 44.1kHz, allows representing sounds with frequencies up to 22.05kHz, near the top end of the range of human hearing. Because of this, the 44.1kHz sample rate is probably the most common audio sample rate out there.&lt;/p&gt;

&lt;p&gt;The value of each sample also has to be discretized, since computers can't represent arbitrary real numbers with infinite precision. Typically, each sample is stored as a signed, 16-bit integer, where the 16-bit integer range of [-32768, 32767] is mapped onto the conceptual sample range of [-1, 1]. For more fidelity, 24-bit values can be used, or for more compact storage, 8-bit values can be used.&lt;/p&gt;

&lt;p&gt;Floating point is also a common format. This allows direct use of the normalized [1, -1] range, and good precision. Modern CPUs are fast with floating point, so performance is good, and the data is convenient to work with in code.&lt;/p&gt;

&lt;p&gt;This system of representing audio using a series of samples taken at intervals is called pulse-code modulation, or PCM. "PCM" is often used to refer to this representation of "raw" audio, in contrast to various encodings like MP3 or AAC.&lt;/p&gt;

&lt;p&gt;Probably the most common digital audio representation out there is 44.1kHz, 16-bit audio. This is what CDs hold, and most digital music files. 44.1kHz is enough to represent sounds up to the maximum frequency most people can hear, and 16 bits is generally granular enough to not produce audible noise or distortion, at least for most situations.&lt;/p&gt;

&lt;p&gt;On Apple devices, 32-bit floats are the most common in-memory format for audio, since they can faithfully represent the full range of 24-bit integers, and are convenient to work with.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;A Quick Note on Floating Point&lt;/b&gt;&lt;br&gt;Using a floating-point number to represent only values between -1 and 1 may sound wasteful. After all, a 32-bit &lt;code&gt;float&lt;/code&gt; can represent values between -3.4&amp;times;10&lt;sup&gt;38&lt;/sup&gt; and 3.4&amp;times;10&lt;sup&gt;38&lt;/sup&gt;. Audio uses only a tiny fraction of that range.&lt;/p&gt;

&lt;p&gt;It turns out, however, that restricting the range to [-1, 1] only wastes one bit out of the 32 available. In effect, it's being used as a 31-bit number stored in 32 bits of memory, which doesn't sound so bad at all. This is because of how floating point numbers are stored.&lt;/p&gt;

&lt;p&gt;The short version is that &lt;code&gt;float&lt;/code&gt;s are represented in the form m&amp;times;2&lt;sup&gt;e&lt;/sup&gt;, where &lt;code&gt;m&lt;/code&gt; and &lt;code&gt;e&lt;/code&gt; are stored in the number, along with one bit to indicate the sign. For a 32-bit &lt;code&gt;float&lt;/code&gt;, the exponent (&lt;code&gt;e&lt;/code&gt;) can range from -128 to 127. Values with exponents in the range [-128, -1] represent the range between -1 and 1. That range restriction simply cuts the exponent's range in half, which equates to restricting only a single bit from its value.&lt;/p&gt;

&lt;p&gt;For more details on the floating-point representation, see &lt;a href="friday-qa-2011-01-04-practical-floating-point.html"&gt;my previous article on floating point arithmetic&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Stereo&lt;/b&gt;&lt;br&gt;You may have noticed that most people have two ears. Because of this, sound recorded as two separate streams sounds nicer to most people than a single stream. Conceptually, this audio can be thought of as two separate functions of pressure over time.&lt;/p&gt;

&lt;p&gt;To represent stereo sound in data, those two functions have to be represented simultaneously. The most common way is to simply interleave the two channels, so that the first value in a buffer would be the left channel, the second value the right channel, then left again, etc. In memory, it would look like:&lt;/p&gt;

&lt;pre&gt;    LRLRLRLRLRLRLRLRLR
&lt;/pre&gt;

&lt;p&gt;It's also possible to simply use two completely different buffers, which just looks like:&lt;/p&gt;

&lt;pre&gt;    buffer 1: LLLLLLLLLL
    buffer 2: RRRRRRRRRR
&lt;/pre&gt;

&lt;p&gt;Deinterleaved data like this can be more convenient to work with, but the interleaved representation is more commonly used simply because it keeps everything in one place.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Obtaining Audio Data&lt;/b&gt;&lt;br&gt;There is no audio equivalent to Cocoa's &lt;code&gt;NSImage&lt;/code&gt; class. The &lt;code&gt;NSSound&lt;/code&gt; class may seem promising, but it's &lt;em&gt;extremely&lt;/em&gt; limited and provides no way to extract the underlying audio data.&lt;/p&gt;

&lt;p&gt;Instead, I'll drop down to Core Audio, which excels at this kind of thing. It's not the nicest API in the world, but it's entirely capable. I believe the newer AV Foundation APIs are also capable of extracting raw audio data, but for a basic task like this, Core Audio does just fine.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/ExtendedAudioFileServicesReference/Reference/reference.html"&gt;Extended Audio File Services API&lt;/a&gt; does exactly this. Given a file, it produces the raw audio data contained within.&lt;/p&gt;

&lt;p&gt;The first thing to do is to create an &lt;code&gt;ExtAudioFileRef&lt;/code&gt; pointing at the file we're interested in. The &lt;code&gt;ExtAudioFileOpenURL&lt;/code&gt; function takes a &lt;code&gt;CFURLRef&lt;/code&gt; and creates an audio file object for it. It returns an error value, with the audio file object returned by reference in one of the parameters:&lt;/p&gt;

&lt;pre&gt;    NSURL *urlToFile = ...;
    ExtAudioFileRef af = NULL;
    OSStatus err = ExtAudioFileOpenURL((__bridge CFURLRef)urlToFile, &amp;amp;af);
    if(err != noErr)
        // Handle the error here
&lt;/pre&gt;

&lt;p&gt;It's important to check errors! Code like this is really easy to mess up, and if you write code that ignores errors, it can add hours to your debugging for no good reason. Always check the error from any function that returns them.&lt;/p&gt;

&lt;p&gt;The next step is to tell the audio file object what kind of in-memory format we want for the audio. This is done using an &lt;code&gt;AudioStreamBasicDescription&lt;/code&gt;, or ASBD. This is a structure which contains fields for the sample rate, number of channels, etc. Here is an ASBD that asks for 44.1kHz, mono, PCM audio using &lt;code&gt;float&lt;/code&gt; to store the samples:&lt;/p&gt;

&lt;pre&gt;    AudioStreamBasicDescription clientASBD = {
        .mSampleRate = 44100,
        .mFormatID = kAudioFormatLinearPCM,
        .mFormatFlags = kAudioFormatFlagsNativeFloatPacked,
        .mBitsPerChannel = sizeof(float) * CHAR_BIT,
        .mChannelsPerFrame = 1,
        .mFramesPerPacket = 1,
        .mBytesPerFrame = sizeof(float),
        .mBytesPerPacket = sizeof(float)
    };
&lt;/pre&gt;

&lt;p&gt;This structure contains what appears to be redundant information, in the form of the channels and bytes per frame/packet. These fields exist for the benefit of non-PCM formats, and because the ASBD &lt;code&gt;struct&lt;/code&gt; is used in many different situations. To understand the meaning of these fields, here are some quick definitions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sample:&lt;/strong&gt; a single number representing the value of one audio channel at one point in time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Frame:&lt;/strong&gt; a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Packet:&lt;/strong&gt; a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many audio formats use packets that are considerably longer than a single frame. MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all. For the ASBD's purpose, the packet size is equal to the frame size.&lt;/p&gt;

&lt;p&gt;Note that the above definitions are a bit loose, and people often use these terms in different ways. However, these definitions are how Core Audio uses the words, which is what counts when we're writing code for that API!&lt;/p&gt;

&lt;p&gt;With the structure filled out, the code uses it to tell the audio file object what kind of data we want:&lt;/p&gt;

&lt;pre&gt;    err = ExtAudioFileSetProperty(af, kExtAudioFileProperty_ClientDataFormat, sizeof(clientASBD), &amp;amp;clientASBD);
    if(err != noErr)
        // Handle the error
&lt;/pre&gt;

&lt;p&gt;Once again, error handling is important. It's easy to build an ASBD that Core Audio doesn't like, and catching that error early will save a lot of pain.&lt;/p&gt;

&lt;p&gt;You won't always want to specify every detail of the audio format like this. For many purposes, you'll want to use the sample rate of the data in the file rather than specifying one and having Core Audio resample the audio if necessary. Likewise, you'll often want to use the number of channels present in the file rather than simply requesting or forcing a certain number of channels. This can be done simply by calling &lt;code&gt;ExtAudioFileGetProperty&lt;/code&gt; and getting the &lt;code&gt;kExtAudioFileProperty_FileDataFormat&lt;/code&gt; property. This will return an ASBD describing the file's format, and the file's sample rate and number of channels can easily be extracted from that.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Reading&lt;/b&gt;&lt;br&gt;With the audio file object set up, it's time to start reading data from it.&lt;/p&gt;

&lt;p&gt;Since audio data is a stream, and audio files can be huge, it's common to read from them only a piece at a time and move on, rather than trying to read the entire thing into memory at once. Audio data can be pretty big, especially once decoded into memory. A typical 1MB MP3 will decode to about 10MB in memory using the audio format specified in this code, and twice that for stereo.&lt;/p&gt;

&lt;p&gt;To that end, we'll define a fixed-size buffer to read audio data into:&lt;/p&gt;

&lt;pre&gt;    int bufferFrames = 4096;
    float data[bufferFrames];
&lt;/pre&gt;

&lt;p&gt;For multi-channel audio, the array size needs to be multiplied by the number of channels. We're just reading mono, though, so the array size is equal to the number of frames.&lt;/p&gt;

&lt;p&gt;The function to actually read an audio file takes an &lt;code&gt;AudioBufferList&lt;/code&gt;, which is just a structure of &lt;code&gt;AudioBuffer&lt;/code&gt; structures. This code constructs an &lt;code&gt;AudioBuffer&lt;/code&gt; for the above array, and an &lt;code&gt;AudioBufferList&lt;/code&gt; containing a single entry for that buffer:&lt;/p&gt;

&lt;pre&gt;    AudioBuffer buffer = {
        .mNumberChannels = 1,
        .mDataByteSize = sizeof(data),
        .mData = data
    };

    AudioBufferList bufferList;
    bufferList.mNumberBuffers = 1;
    bufferList.mBuffers[0] = buffer;
&lt;/pre&gt;

&lt;p&gt;With that in place, it's time to actually read the data. The read function takes an odd shortcut, where it takes a pointer to a number of frames to read, and then sets that variable to the number of frames actually read. It returns an error just like the rest of these functions:&lt;/p&gt;

&lt;pre&gt;    UInt32 ioFrames = bufferFrames; /* Request reading 4096 frames. */
    err = ExtAudioFileRead(af, &amp;amp;ioFrames, &amp;amp;bufferList);
    if(err != noErr)
        // Handle the error here
&lt;/pre&gt;

&lt;p&gt;At this point, &lt;code&gt;ioFrames&lt;/code&gt; contains the number of frames actually read. The data itself can be found in &lt;code&gt;data[0]&lt;/code&gt;, &lt;code&gt;data[1]&lt;/code&gt;, ..., &lt;code&gt;data[ioFrames - 1]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To read the entire file, simply run the above in a loop until &lt;code&gt;ioFrames&lt;/code&gt; comes out &lt;code&gt;0&lt;/code&gt;, signaling the end of the file.&lt;/p&gt;

&lt;p&gt;Once you're done, don't forget to clean up the audio file object:&lt;/p&gt;

&lt;pre&gt;    if(af)
    {
        err = ExtAudioFileDispose(af);
        if(err != noErr)
            // How do you handle an error from a dispose function?
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Interpreting the Data&lt;/b&gt;&lt;br&gt;With the audio data in a buffer, the program can read elements out of the buffer to get audio samples. However, interpreting that data can get tricky. Unlike images, where a single pixel has meaning, a single audio sample has essentially no meaning on its own. Audio is fundamentally a result of change over time, so looking at a single sample of, say, &lt;code&gt;0.5&lt;/code&gt; doesn't tell us anything. Contrast this with an image, where a single pixel with an RGB value of &lt;code&gt;(255, 0, 0)&lt;/code&gt; tells us that this pixel is red, even if it doesn't tell us anything about the rest of the image.&lt;/p&gt;

&lt;p&gt;Interpreting audio data can get extremely difficult, but we can at least cover some basics here.&lt;/p&gt;

&lt;p&gt;The sample values are simply deviations from the center. If you wanted to render a visual representation of the waveform, for example, you could simply transform the sample values to vertical offsets from the centerline of your waveform view. &lt;a href="http://supermegaultragroovy.com/2009/10/06/drawing-waveforms/"&gt;There can be a lot more to it&lt;/a&gt;, but for the basic idea, you can simply generate &lt;code&gt;(x, y)&lt;/code&gt; values like so:&lt;/p&gt;

&lt;pre&gt;    x = sampleNumber;
    y = data[sampleNumber] * viewHeight / 2 + viewHeight / 2;
&lt;/pre&gt;

&lt;p&gt;It can be interesting and useful to figure out how loud a particular piece of audio is. Loudness is typically measured in decibels, which is a logarithmic scale. It's calculated using the base 10 logarithm of a ratio of the audio's power with a reference power level. The pure logarithm produces a value in &lt;em&gt;bels&lt;/em&gt;, and to obtain decibels, simply multiply that number by 10. The power level of a piece of audio is proportional to the square of the amplitude, and the amplitude is exactly what the individual audio samples describe.&lt;/p&gt;

&lt;p&gt;To compute the total power level of a piece of audio, simply average the squares of all the samples:&lt;/p&gt;

&lt;pre&gt;    float accumulator = 0;
    for(int i = 0; i &amp;lt; frames; i++)
        accumulator += data[i] * data[i];
    float power = accumulator / frames;
&lt;/pre&gt;

&lt;p&gt;For computer audio, the reference power level is typically &lt;code&gt;1.0&lt;/code&gt;, which is the loudest possible. To compute decibels, take the base-10 logarithm of the computed power divided by the reference power then multiply the result by 10. Dividing by &lt;code&gt;1.0&lt;/code&gt; does nothing, so it's ommitted from the calculation:&lt;/p&gt;

&lt;pre&gt;    float decibels = 10 * log10f(power);
&lt;/pre&gt;

&lt;p&gt;Mathematically astute readers will notice that squaring the amplitude when calculating the power is equivalent so simply calculating &lt;code&gt;20 * log10f(averageAmplitude)&lt;/code&gt;, which can simply things slightly. However, don't forget to take the absolute value of the amplitudes when calculating the averages, because otherwise the samples are likely to cancel each other out.&lt;/p&gt;

&lt;p&gt;The resulting decibel value is &lt;em&gt;negative&lt;/em&gt;, which may be confusing if you're used to seeing decibels written as a positive value. It's important to understand that decibels are a &lt;em&gt;relative&lt;/em&gt; measure, and need some sort of reference power or amplitude to be meaningful. When audible sound levels are expressed in decibels, the reference level is a standard number which is roughly the quietest sound that a normal human can hear. If a sound is described as 10dB, that means that it's 10 times more powerful than the quietest perceptible sound. 20dB means that it's 100 times more powerful, etc.&lt;/p&gt;

&lt;p&gt;For computer audio, the maximum possible output power is typically used as the reference level, because there is no meaningful minimum level to compare with. An audio file filled with zeros has zero output power, and attempting to compute a ratio with that would divide by zero. The result is that, when talking about computer audio, 0dB is the maximum level possible, and negative values are quieter. If audio is described as being at a level of -30dB, that means that it's 1000 times less powerful than this reference maximum value.&lt;/p&gt;

&lt;p&gt;To adjust the volume of audio, simply multiply each sample by a fixed gain. The gain can be calculated from a desired change in decibels by reversing the formula above. For example, to achieve a volume increase of 30dB, multiply the power by &lt;code&gt;1000&lt;/code&gt;, which is equivalent to a gain of &lt;code&gt;sqrt(1000)&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    float decibelsAdjust = ...;
    float gain = pow(10, decibelsAdjust / 20);
    for(int i = 0; i &amp;lt; frames; i++)
        data[i] *= gain;
&lt;/pre&gt;

&lt;p&gt;Note that this works equally well for negative values, which cause a decrease in volume by the specified number of decibels.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Writing Audio Out&lt;/b&gt;&lt;br&gt;You've done a volume adjustment or maybe some other manipulation to the audio data, and you want to get audio back out of your program. The &lt;code&gt;ExtAudioFile&lt;/code&gt; APIs make this easy.&lt;/p&gt;

&lt;p&gt;The first thing to do is to create a new audio file and a new audio file object all at once. This is done by calling &lt;code&gt;ExtAudioFileCreateWithURL&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;        ExtAudioFileRef outAF = NULL;
        err = ExtAudioFileCreateWithURL((__bridge CFURLRef)outURL, kAudioFileCAFType, &amp;amp;clientASBD, NULL, kAudioFileFlags_EraseFile, &amp;amp;outAF);
        if(err != noErr)
            // You should know what to do here by now
&lt;/pre&gt;

&lt;p&gt;This function takes several more parameters than &lt;code&gt;ExtAudioFileOpenURL&lt;/code&gt;. The first parameter is a URL to the file to create. The second one is the file type to create. &lt;code&gt;ExtAudioFile&lt;/code&gt; &lt;a href="https://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/AudioFileConvertRef/Reference/reference.html#//apple_ref/c/tdef/AudioFileTypeID"&gt;supports many different formats&lt;/a&gt;. I chose CAF, as it's a simple format that stores raw PCM samples with a minimum of fuss, and was designed specifically to be nice to use with Core Audio.&lt;/p&gt;

&lt;p&gt;The third parameter is the format that will be used for audio within the file. By default, it's also used as the in-memory audio format. For convenience, I'm using the same format as the in-memory format specified earlier. In more realistic situations, you'd likely want to specify an in-file format that's more compact or useful (e.g. 16-bit integer), then use the &lt;code&gt;kExtAudioFileProperty_ClientDataFormat&lt;/code&gt; property to specify the in-memory format, just as we did when reading.&lt;/p&gt;

&lt;p&gt;The fourth parameter is an audio channel layout struct, which is optional and only needed for more advanced uses. For this, I leave it NULl.&lt;/p&gt;

&lt;p&gt;The fifth parameter tells the system what to do if the file already exists. I want it to erase any existing file and make a new one, so I pass &lt;code&gt;kAudioFileFlags_EraseFile&lt;/code&gt;. The last parameter is a pointer to where the newly created audio file object will be stored.&lt;/p&gt;

&lt;p&gt;Writing audio to this file is almost exactly the same as reading it. You need a buffer of data, an &lt;code&gt;AudioBuffer&lt;/code&gt; struct, and an &lt;code&gt;AudioBufferList&lt;/code&gt;. Then simply call &lt;code&gt;ExtAudioFileWrite&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;    err = ExtAudioFileWrite(outAF, ioFrames, &amp;amp;bufferList);
    if(err != noErr)
        // Guess what
&lt;/pre&gt;

&lt;p&gt;If you stick this into the same loop as the audio reading code from above, you'll get a program that creates a new CAF audio file using the contents from an existing audio file. If you modify the contents of the &lt;code&gt;data&lt;/code&gt; buffer before writing, the new audio file will contain your modified contents.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;The world of audio can be strange and mysterious, but the basics are pretty simple. The &lt;code&gt;ExtAudioFile&lt;/code&gt; APIs make it easy, relatively speaking, to get at the audio data of a file. That audio data is represented as a series of samples in memory between -1 and 1. If you want to make changes and build a new file, &lt;code&gt;ExtAudioFile&lt;/code&gt; makes it easy to write the new audio back out to disk.&lt;/p&gt;

&lt;p&gt;That's it for today. You probably know by now that Friday Q&amp;amp;A is driven by reader submissions, so as always, please &lt;a href="mailto:mike@mikeash.com"&gt;send in your ideas for topics&lt;/a&gt; if you have a subject you'd like to see covered here.&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-10-12-obtaining-and-interpreting-audio-data.html</guid><pubDate>Fri, 12 Oct 2012 13:20:00 GMT</pubDate></item><item><title>Friday Q&amp;amp;A 2012-09-28: Optimizing Flood Fill
</title><link>http://www.mikeash.com/pyblog/friday-qa-2012-09-28-optimizing-flood-fill.html</link><description>&lt;html&gt;
              &lt;head&gt;
              &lt;title&gt;Friday Q&amp;amp;A 2012-09-28: Optimizing Flood Fill
&lt;/title&gt;
              &lt;blogattrs
                  postdate="2012 09 28  13 17"
                  tags="fridayqa cocoa image"
                  /&gt;
              &lt;/head&gt;
              &lt;body&gt;
              &lt;!-- enable-comments --&gt;
              &lt;div class="blogtitle"&gt;Friday Q&amp;amp;A 2012-09-28: Optimizing Flood Fill
&lt;/div&gt;
              &lt;p&gt;&lt;a href="friday-qa-2012-09-14-implementing-a-flood-fill.html"&gt;Last time&lt;/a&gt;, I presented the implementation of a basic flood fill. That implementation works fine, but is not terribly fast. Today, I'm going to walk through various optimizations that can be done to make it faster, using techniques which I hope will be useful in many different situations. Ultimately, starting from the original code, I was able to achieve over an order of magnitude speedup while still keeping the code relatively straightforward and clear.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Measurement&lt;/b&gt;&lt;br&gt;A prerequisite for any optimization work is the ability to measure the code in question. In this case, we're optimizing for speed, so we have to measure how fast the code is. If you can't measure it, it's nearly impossible to make it faster, because you have only a vague idea how much difference any given change makes, or whether it's even an improvement at all. Lots of seemingly-obvious optimizations can end up making code slower, so it's important to have real data.&lt;/p&gt;

&lt;p&gt;If at all possible, it's best to have a consistent scenario that you can run in an automated fashion. If you can simply push a button and get a number out that says how fast the code is, that's ideal. Anything that requires manual setup will add a lot of work. Anything that requires manual &lt;em&gt;inputs&lt;/em&gt; to the process is begging for trouble: if your inputs vary from one run to the next, you can't tell what changes are due to your code and what changes are due to your changing inputs! This can be important for UI-heavy code where the user's specific actions can heavily influence how much code runs and when it runs. Whenever possible, isolate the code in question away from the user and pass it pre-baked input.&lt;/p&gt;

&lt;p&gt;For the flood fill, we need an image to fill. I wrote a quick method that creates a &lt;code&gt;NSBitmapImageRep&lt;/code&gt; using &lt;code&gt;NSGraphicsContext&lt;/code&gt; to draw into it, building a square covering most of the image, with some diagonal lines to ensure that the flood fill has something interesting to work on. Here's the code for that:&lt;/p&gt;

&lt;pre&gt;    NSBitmapImageRep *TestImageRep(int width, int height)
    {
        NSBitmapImageRep *rep = [[NSBitmapImageRep alloc]
                                 initWithBitmapDataPlanes: NULL
                                 pixelsWide: width
                                 pixelsHigh: height
                                 bitsPerSample: 8
                                 samplesPerPixel: 4
                                 hasAlpha: YES
                                 isPlanar: NO
                                 colorSpaceName: NSCalibratedRGBColorSpace
                                 bytesPerRow: width * 4
                                 bitsPerPixel: 32];

        NSGraphicsContext *ctx = [NSGraphicsContext graphicsContextWithBitmapImageRep: rep];

        [NSGraphicsContext saveGraphicsState];
        [NSGraphicsContext setCurrentContext: ctx];

        NSRect r;

        [[NSColor whiteColor] setFill];
        r = NSMakeRect(0, 0, width, height);
        NSRectFill(r);

        [[NSColor blackColor] setFill];
        r = NSMakeRect(width / 8, height / 8, width * 3 / 4, height * 3 / 4);
        NSRectFill(r);

        [[NSColor greenColor] setStroke];
        for(int i = 0; i &amp;lt; 4; i++)
        {
            NSPoint p1 = NSMakePoint(width * (i + 1) / 5, height / 5);
            NSPoint p2 = NSMakePoint(width / 5, height * (i + 1) / 5);
            [NSBezierPath strokeLineFromPoint: p1 toPoint: p2];
        }

        [ctx flushGraphics];
        [NSGraphicsContext restoreGraphicsState];

        return rep;
    }
&lt;/pre&gt;

&lt;p&gt;This function can then be used to create the image that each flood fill test will operate on, to make each test totally consistent. The dimensions are left up to the caller here, and I decided to test on images of &lt;code&gt;10000&lt;/code&gt; pixels square, which is hefty enough for the operation to take a substantial amount of time, but still fast enough to make for a decent testing cycle.&lt;/p&gt;

&lt;p&gt;For the measurement itself, I wrote a function that takes a block, runs it several times, and measures how fast it runs. It's not generally a good idea to only run the code once, because transient events can strongly influence how long a particular run takes. For example, a background process might suddenly allocate a bunch of memory, causing your process's memory to be swapped out to disk, then back in again, making it run much slower than usual. Running several times helps even out these transients.&lt;/p&gt;

&lt;p&gt;The next question is, how do you distill the results from all of the runs into a single number you can compare? The most obvious way is to simply take an average, but this isn't necessarily appropriate. For computation-heavy code like this, there are a lot of random events that could pop up to make it run slower than usual, but essentially nothing could happen to make it run &lt;em&gt;faster&lt;/em&gt; than usual.&lt;/p&gt;

&lt;p&gt;Given that, the obvious choice is to use the fastest run out of all the trials. However, having an ingrained distrust of outliers, I opeted for the second-fastest run instead. It should provide essentially the same result.&lt;/p&gt;

&lt;p&gt;Here's the function I wrote to do the test. It takes a name and a block and runs the block ten times. It measures each invocation's running time, and tracks all of the running times. Once all of the runs are complete, it grabs the second-best time out of the array and returns it to the caller:&lt;/p&gt;

&lt;pre&gt;    NSArray *Time(NSString *name, void (^block)(void))
    {
        NSMutableArray *times = [NSMutableArray array];
        int trials = 10;
        for(int i = 0; i &amp;lt; trials; i++)
        {
            @autoreleasepool
            {
                NSLog(@"Running trial %d of %d", i + 1, trials);
                NSTimeInterval start = [[NSProcessInfo processInfo] systemUptime];
                block();
                NSTimeInterval end = [[NSProcessInfo processInfo] systemUptime];
                NSLog(@"Done, total elapsed time is %f seconds", end - start);
                [times addObject: @(end - start)];
            }
        }
        NSArray *sortedTimes = [times sortedArrayUsingSelector: @selector(compare:)];
        NSTimeInterval secondBest = [sortedTimes[1] doubleValue];
        NSLog(@"%@: Second best time is %f, times are %@", name, secondBest, sortedTimes);

        return @[name, @(secondBest)];
    }
&lt;/pre&gt;

&lt;p&gt;It also returns the name to the caller, purely to make it easier on the caller when aggregating the results.&lt;/p&gt;

&lt;p&gt;With the testing infrastructure in place, we can measure the performance of the original &lt;code&gt;FloodFill&lt;/code&gt; function. The result on my computer is roughly 26.2 seconds. Tolerable, for such a gigantic image, but not great.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Setting a Lower Bound&lt;/b&gt;&lt;br&gt;I decided to write a simple fill routine to set a bound on the speed of the flood fill function. It simply scans &lt;em&gt;every&lt;/em&gt; pixel of the image and fills the pixel if it's within the threshold from the starting pixel. It will fill areas that are not contiguous with the starting pixel, so it's not useful as a flood fill. However, it's fast, and since it touches every pixel in the image, it gives a pretty decent lower bound. Since the flood fill area in the test image covers most of it, it's unlikely that any flood fill could ever surpass the speed of this dumb whole-image scan:&lt;/p&gt;

&lt;pre&gt;    void CheckAll(struct Pixel *image, int width, int height, int startx, int starty, struct Pixel fillValue, int threshold)
    {
        struct Pixel startPixel = PIXEL(startx, starty);

        for(int i = 0; i &amp;lt; width * height; i++)
        {
            int diff = PixelDiff(startPixel, image[i]);
            if(diff &amp;lt;= threshold)
                image[i] = fillValue;
        }
    }
&lt;/pre&gt;

&lt;p&gt;Running this through the tester gives a run time of 0.5 seconds. Clearly, there's substantial room for improvement over the 26.2 second running time of the original flood fill.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Profiling&lt;/b&gt;&lt;br&gt;Measuring the running time of the function is important to optimization, but to actually perform useful optimizations we need more specific data. We need to profile the code to see where the time is being spent. Although the much-beloved &lt;a href="friday-qa-2009-02-06-profiling-with-shark.html"&gt;Shark&lt;/a&gt; is dead, Instruments is a decent substitute. In particular, the Time Profiler instrument provides far and away the best view of just where code is spending its time.&lt;/p&gt;

&lt;p&gt;The Time Profiler instrument produces a call tree. Expanding it out, it shows each function under the "Symbol Name" column, and the amount and percentage of time spent in that function under the "Running Time" column.&lt;/p&gt;

&lt;p&gt;To get more detail, you can double-click any function, and it will display that function's code and a percentage of time spent on each line. This is extremely useful when you have a slow function that's complex enough that you don't know exactly which part is slow.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Pixels Visited Bitmap&lt;/b&gt;&lt;br&gt;Profiling the original flood fill function, the main source of slowness is immediately obvious. A huge amount of time is spent checking for membership in &lt;code&gt;NSIndexSet&lt;/code&gt; and adding new indexes to the set. There are two &lt;code&gt;NSIndexSet&lt;/code&gt;s in play here, but it seems fairly clear that the worst offender is the &lt;code&gt;pixelsSeen&lt;/code&gt; set, which tracks the pixels that have already been seen and shouldn't be checked again. Over a quarter of the function's running time is spent in &lt;code&gt;containsIndex:&lt;/code&gt;, which is only sent to &lt;code&gt;pixelsSeen&lt;/code&gt;. The first order of business is to replace &lt;code&gt;pixelsSeen&lt;/code&gt; with something faster.&lt;/p&gt;

&lt;p&gt;This variable basically stores a single bit of information per pixel, namely whether that pixel has been visited or not. A natural way to store this information is with a bitmap. Essentially, make a new image with one bit per pixel that corresponds exactly to the image being filled. To mark a pixel as visited, set the bit to &lt;code&gt;1&lt;/code&gt;. To test a pixel, just grab the corresponding bit.&lt;/p&gt;

&lt;p&gt;C doesn't offerd bit-addressed arrays, but no matter. We can use an array of bytes, and do some bit twiddling to get at the individual bits. For a given index, the byte to access is just &lt;code&gt;index / 8&lt;/code&gt;, and the bit is just &lt;code&gt;index % 8&lt;/code&gt;. I used the &lt;code&gt;CHAR_BIT&lt;/code&gt; macro to be clean, even though the odds of running this code on a system where &lt;code&gt;CHAR_BIT != 8&lt;/code&gt; are only marginally better than the odds of winning the lottery while being simultaneously struck by lightning and being eaten by a shark.&lt;/p&gt;

&lt;p&gt;With the byte and bit index, a bit of bitshifting and a bitwise &lt;code&gt;or&lt;/code&gt; allows setting a bit within the bitmap:&lt;/p&gt;

&lt;pre&gt;    static inline void AddToBitmap(int x, int y, uint8_t *bitmap, int width)
    {
        int index = x + y * width;
        int bytes = index / CHAR_BIT;
        int bits = index % CHAR_BIT;
        uint8_t bit = 1 &amp;lt;&amp;lt; bits;
        bitmap[bytes] |= bit;
    }
&lt;/pre&gt;

&lt;p&gt;Similarly, a bitwise &lt;code&gt;and&lt;/code&gt; allows checking whether the bit is set:&lt;/p&gt;

&lt;pre&gt;    static inline BOOL CheckBitmap(int x, int y, uint8_t *bitmap, int width)
    {
        int index = x + y * width;
        int bytes = index / CHAR_BIT;
        int bits = index % CHAR_BIT;
        uint8_t bit = 1 &amp;lt;&amp;lt; bits;
        return bitmap[bytes] &amp;amp; bit ? YES : NO;
    }
&lt;/pre&gt;

&lt;p&gt;The flood fill function must then allocate the bitmap when it starts up. It has to allocate a chunk of memory whose size is equal to the number of pixels in the image divided by eight, rounded up. This bit of code handles that:&lt;/p&gt;

&lt;pre&gt;    int length = width * height;
    int bytes = (length + CHAR_BIT - 1) / CHAR_BIT;
    uint8_t *pixelsSeen = calloc(1, bytes);
&lt;/pre&gt;

&lt;p&gt;The uses of &lt;code&gt;pixelsSeen&lt;/code&gt; then get replaced with calls to &lt;code&gt;AddToBitmap&lt;/code&gt; and &lt;code&gt;CheckBitmap&lt;/code&gt;. Finally, we free the bitmap at the end of the function:&lt;/p&gt;

&lt;pre&gt;    free(pixelsSeen);
&lt;/pre&gt;

&lt;p&gt;Running the new version, the running time for the flood fill is down to 12.6 seconds. This one change made things over twice as fast!&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Pixel Queue&lt;/b&gt;&lt;br&gt;Profiling the new code shows that &lt;code&gt;NSIndexSet&lt;/code&gt; is once again taking up most of the time. The &lt;code&gt;pixelsToExamine&lt;/code&gt; variable is the only &lt;code&gt;NSIndexSet&lt;/code&gt; remaining, so it clearly must be the culprit. &lt;code&gt;NSIndexSet&lt;/code&gt; is fast enough to be acceptable here, but it's clearly not optimized for this particular use. This is only to be expected, since this is definitely not the use it was built for.&lt;/p&gt;

&lt;p&gt;I'll replace the &lt;code&gt;NSIndexSet&lt;/code&gt; with a simple stack of coordinates. This will be a big array and an index indicating the top of the array. Adding a coordinate will mean simply setting the element at the top of the array and incrementing the index. Retrieving the next coordinate to examine is simply a matter of decrementing the index and grabbing the value at the top of the array. Allocating enough memory to hold a coordinate for every pixel is impractical, so instead I'll dynamically increase the size of the array when required.&lt;/p&gt;

&lt;p&gt;First, I make a variable to hold the current array length. This lets the code know when it hits the end of the array and needs to reallocate:&lt;/p&gt;

&lt;pre&gt;    int pixelsToExamineLength = 128;
&lt;/pre&gt;

&lt;p&gt;I chose &lt;code&gt;128&lt;/code&gt; simply because it's a decent middle size, and the number sounded nice. The exact number is probably not too important, as long as it's not &lt;code&gt;0&lt;/code&gt; and not so large that it consumes an excessive amount of memory for no reason.&lt;/p&gt;

&lt;p&gt;Next, I make a simple &lt;code&gt;Coordinate&lt;/code&gt; structure, a variable to point to the array of coordinates, and allocate memory for it:&lt;/p&gt;

&lt;pre&gt;    struct Coordinate { int x, y; };
    struct Coordinate *pixelsToExamine = malloc(sizeof(*pixelsToExamine) * pixelsToExamineLength);
&lt;/pre&gt;

&lt;p&gt;Then, there's a variable to hold the index of the current top of the stack:&lt;/p&gt;

&lt;pre&gt;    int pixelsToExamineIndex = 0;
&lt;/pre&gt;

&lt;p&gt;Finally, the starting pixel is added to the stack:&lt;/p&gt;

&lt;pre&gt;    pixelsToExamine[pixelsToExamineIndex++] = (struct Coordinate){ startx, starty };
&lt;/pre&gt;

&lt;p&gt;The top of the loop now examines and pulls from the array:&lt;/p&gt;

&lt;pre&gt;    while(pixelsToExamineIndex &amp;gt; 0)
    {
        struct Coordinate coordinate = pixelsToExamine[--pixelsToExamineIndex];

        int x = coordinate.x;
        int y = coordinate.y;
&lt;/pre&gt;

&lt;p&gt;When adding a new coordinate to the array, it first checks the index to see if the array is full and needs to be reallocated. If it's full, it just doubles the length and calls &lt;code&gt;realloc&lt;/code&gt; to allocate more memory:&lt;/p&gt;

&lt;pre&gt;    if(pixelsToExamineIndex &amp;gt;= pixelsToExamineLength)
    {
        pixelsToExamineLength *= 2;
        pixelsToExamine = realloc(pixelsToExamine, sizeof(*pixelsToExamine) * pixelsToExamineLength);
    }
&lt;/pre&gt;

&lt;p&gt;Once enough memory is available, it simply stores the desired coordinate into the top of the array and increments the index:&lt;/p&gt;

&lt;pre&gt;    pixelsToExamine[pixelsToExamineIndex++] = (struct Coordinate){ nextX, nextY };
&lt;/pre&gt;

&lt;p&gt;Don't forget to deallocate the memory at the end of the function:&lt;/p&gt;

&lt;pre&gt;    free(pixelsToExamine);
&lt;/pre&gt;

&lt;p&gt;How did we do? Running this new flood fill function takes about 4.8 seconds, once again more than doubling the performance of the code. We're doing very well compared to the original time of 26.2 seconds, but more performance can be extracted.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Linear Memory Access&lt;/b&gt;&lt;br&gt;Profiling this latest version now shows basically all of the run time is spent within the flood fill function itself. This is only to be expected, as it hardly makes any external calls at this point. Aside from very occasionally allocating or deallocating memory, it spends all of its time working internally. At this point, it's time to take advantage of Time Profiler's ability to examine the code within a function and see what the remaining bottlenecks are.&lt;/p&gt;

&lt;p&gt;Double-clicking on the flood fill function in Instruments pulls up the source code and shows the time spent on each line. The line taking up the most time by far is the body of &lt;code&gt;ComponentDiff&lt;/code&gt;, which is just:&lt;/p&gt;

&lt;pre&gt;    return MAX(a, b) - MIN(a, b);
&lt;/pre&gt;

&lt;p&gt;Clearly there isn't much that can be done to speed this up. This is already as fast as it can get. The only other way to speed this up would be to simply call it less, but &lt;code&gt;ComponentDiff&lt;/code&gt; has to run at least three times on every pixel examined (once per component), and the current code only runs it three times per pixel. This code is now as fast as it can possibly go, right?&lt;/p&gt;

&lt;p&gt;Right?&lt;/p&gt;

&lt;p&gt;Obviously the answer is "no", otherwise you'd be at the end of the article now. However, the way forward from here is deeply non-obvious and took me a long time to figure out.&lt;/p&gt;

&lt;p&gt;I had some false starts trying to make the &lt;code&gt;ComponentDiff&lt;/code&gt; function faster. Was there a way to make &lt;code&gt;MIN&lt;/code&gt; and &lt;code&gt;MAX&lt;/code&gt; faster? Could the subtraction be vectorized, so that all three components run at once?&lt;/p&gt;

&lt;p&gt;It turned out that this was all completely wrong-headed, and the answer lies in a completely different direction.&lt;/p&gt;

&lt;p&gt;RAM, despite being officially "random access", isn't truly random access these days. Modern computer memory is a complex hierarchical system which is optimized for common access patterns. Truly random access is quite slow compared to linearly reading or writing a long, contiguous chunk of memory.&lt;/p&gt;

&lt;p&gt;When it comes to manipulating an image, this means that you always want to write code that iterates over &lt;code&gt;x&lt;/code&gt; in the inner loop, like so:&lt;/p&gt;

&lt;pre&gt;    for(int y = 0; y &amp;lt; height; y++)
        for(int x = 0; x &amp;lt; width; x++)
            // Do something with pixel (x, y)
&lt;/pre&gt;

&lt;p&gt;Since images are normally laid out in memory by rows, this code iterates over memory in a completely linear fashion, which is really fast. Writing the loops in a more conventional order ends up being substantially slower:&lt;/p&gt;

&lt;pre&gt;    for(int x = 0; x &amp;lt; width; x++)
        for(int y = 0; y &amp;lt; height; y++)
            // Do something with pixel (x, y)
&lt;/pre&gt;

&lt;p&gt;This code jumps around a lot. For a 32-bit image that's 1024 pixels wide, this code will read the first pixel, then read a second pixel that's 4kB away in memory. The third pixel will be another 4kB down, etc. This scattered memory access is really slow.&lt;/p&gt;

&lt;p&gt;Flood fill is not quite the simple systematic iteration of the above for loops, but it's ultimately similar. It iterates over a potentially large number of contiguous pixels, and the order in which it iterates will affect just how quickly the computer's memory is able to return pixels.&lt;/p&gt;

&lt;p&gt;The massive amount of time spent in &lt;code&gt;ComponentDiff&lt;/code&gt; is suspicious. It's a pretty simple operation, and there's a lot of other code running, so why is this one operation taking up half of the total run time? With the performance characteristics of modern RAM in mind, maybe the calculations themselves aren't taking up all of this time. Instead, it might just be that the calculations are spending a lot of time waiting for the data to be loaded from RAM.&lt;/p&gt;

&lt;p&gt;Let's look at the access patterns of the flood fill function to see if this is really it, and whether it can be improved. The latest flood fill function uses a stack to store which pixels to examine. A stack operates in a last-in, first-out manner, where things come off the stack in the opposite order that they were put in. The last thing placed on the stack is the first thing to be retrieved. When a pixel adds adjacent pixels to the stack, those adjacent pixels will be the next ones to be examined. This contiguous access is what we want.&lt;/p&gt;

&lt;p&gt;The big question is then: does it access rows before columns, or columns before rows? Here's the code that enqueues the new pixels to examine:&lt;/p&gt;

&lt;pre&gt;    int nextXs[4] = { x + 1, x - 1, x, x };
    int nextYs[4] = { y, y, y + 1, y - 1 };
    for(int i = 0; i &amp;lt; 4; i++)
    {
        int nextX = nextXs[i];
        int nextY = nextYs[i];
        // enqueue...
&lt;/pre&gt;

&lt;p&gt;Remember, these pixels are popped off the stack in last-in, first-out order. The first pixel examined will be &lt;code&gt;(x, y - 1)&lt;/code&gt;, which is the pixel above the current pixel. That pixel will do its own thing and enqueue the one above &lt;em&gt;it&lt;/em&gt;, and so on until the entire column is filled above. After the flood fill runs in that direction, it'll run downward, then finally go left and lastly right. This is pretty much the opposite of what we need for good performance.&lt;/p&gt;

&lt;p&gt;This is really easy to fix, though: just reverse the arrays!&lt;/p&gt;

&lt;pre&gt;    int nextXs[4] = { x, x, x - 1, x + 1 };
    int nextYs[4] = { y - 1, y + 1, y, y };
&lt;/pre&gt;

&lt;p&gt;With only this change, the flood fill function drops from 4.8 seconds to 2.9 seconds. This really shows the value of good memory access patterns!&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Linear Memory Access, Part Two&lt;/b&gt;&lt;br&gt;Although the last optimization almost looked impossible, it's still worth running the profiler again to see if the change exposed any new bottlenecks. &lt;code&gt;ComponentDiff&lt;/code&gt; is under 10% in the new profile, showing just how much of a difference the memory access pattern makes to memory-intensive code like this.&lt;/p&gt;

&lt;p&gt;There are two noticeable hotspots with the latest flood fill function. One is the &lt;code&gt;pixelsSeen&lt;/code&gt; bitmap, which manifests in a lot of time spent in &lt;code&gt;AddToBitmap&lt;/code&gt; and &lt;code&gt;CheckBitmap&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The access pattern for &lt;code&gt;pixelsSeen&lt;/code&gt; is different from the access pattern for the image's pixels itself. Because it gets checked and set when enqueueing rather than when dequeueing, all four of a pixel's neighbors get checked, and potentially set, immediately. The pixels on the same row are fine, but the pixels above and below are far away in memory. It's exactly the sort of non-local memory access that the previous optimization tries to avoid in the image, although on a much smaller scale, since the pixels in &lt;code&gt;pixelsSeen&lt;/code&gt; are only one bit instead of 32 bits.&lt;/p&gt;

&lt;p&gt;Still, it appears to be worth attacking. My strategy is to change the &lt;code&gt;Coordinate&lt;/code&gt; &lt;code&gt;struct&lt;/code&gt; into a &lt;code&gt;Command&lt;/code&gt; &lt;code&gt;struct&lt;/code&gt; which will hold not only an &lt;code&gt;(x, y)&lt;/code&gt; pair, but also a command to execute on that pair. There will be two commands. The &lt;code&gt;EXAMINE&lt;/code&gt; command will do the same thing that the flood fill algorithm currently does: test the pixel's color and fill it and enqueue new pixels if it matches. The &lt;code&gt;ENQUEUE_VERTICALS&lt;/code&gt; will point at the original pixel, and will enqueue the pixels above and below as part of a second pass.&lt;/p&gt;

&lt;p&gt;This accesses the &lt;code&gt;pixelsSeen&lt;/code&gt; bitmap much more nicely. It will simply scan left and right of the current pixel. After the stack gets popped back down to this pixel, only then will it scan above and below. By this time, those are likely to be filled, so it will just be a quick check and then done. The next command will likely be the command to scan above and below a horizontally adjacent pixel, making for nice linear accesses once again.&lt;/p&gt;

&lt;p&gt;To actually implement this, we first need to define the commands, which I simply wrote as &lt;code&gt;enum&lt;/code&gt;s:&lt;/p&gt;

&lt;pre&gt;    enum {
        EXAMINE,
        ENQUEUE_VERTICALS
    };
&lt;/pre&gt;

&lt;p&gt;Next, &lt;code&gt;Coordinate&lt;/code&gt; and &lt;code&gt;pixelsToExamine&lt;/code&gt; need to change to incorporate a new field to hold the command:&lt;/p&gt;

&lt;pre&gt;    struct Command { int command; int x, y; };
    struct Command *pixelsToExamine = malloc(sizeof(*pixelsToExamine) * pixelsToExamineLength);
&lt;/pre&gt;

&lt;p&gt;I'm now going to be enqueueing pixels in three places, and enqueueing two different commands. To cut back on duplicated code, I wrote a macro to handle enqueueing a command:&lt;/p&gt;

&lt;pre&gt;    #define ENQUEUE(...) do { \
            if(pixelsToExamineIndex &amp;gt;= pixelsToExamineLength) \
            { \
                pixelsToExamineLength *= 2; \
                pixelsToExamine = realloc(pixelsToExamine, sizeof(*pixelsToExamine) * pixelsToExamineLength); \
            } \
            pixelsToExamine[pixelsToExamineIndex++] = (__VA_ARGS__); \
        } while(0)
&lt;/pre&gt;

&lt;p&gt;Note the use of &lt;code&gt;...&lt;/code&gt;, which allows the macro to accept a C99 compound literal, like:&lt;/p&gt;

&lt;pre&gt;    ENQUEUE((struct Command){ command, x, y });
&lt;/pre&gt;

&lt;p&gt;The C preprocessor isn't wise to the ways of compound literals, so it sees that as three separate parameters, separated by the commas.&lt;/p&gt;

&lt;p&gt;The most common scenario will be enqueueing a pixel to examine, so I wrote a macro for that as well. This macro checks the pixel's coordinates to make sure they're within the image's bounds, and also checks the &lt;code&gt;pixelsSeen&lt;/code&gt; bitmap, then uses &lt;code&gt;ENQUEUE&lt;/code&gt; to add a new &lt;code&gt;EXAMINE&lt;/code&gt; command if everything is good:&lt;/p&gt;

&lt;pre&gt;    #define ENQUEUE_PIXEL(x, y) do { \
            if((x) &amp;gt;= 0 &amp;amp;&amp;amp; (y) &amp;gt;= 0 &amp;amp;&amp;amp; (x) &amp;lt; width &amp;amp;&amp;amp; (y) &amp;lt; height) \
                if(!CheckBitmap(x, y, pixelsSeen, width)) \
                    ENQUEUE((struct Command){ EXAMINE, (x), (y) }); \
        } while(0)
&lt;/pre&gt;

&lt;p&gt;Now we have to change the rest of the function to use these commands.&lt;/p&gt;

&lt;p&gt;The first change is when enqueueing the starting pixel. This now becomes:&lt;/p&gt;

&lt;pre&gt;    ENQUEUE((struct Command){ EXAMINE, startx, starty });
&lt;/pre&gt;

&lt;p&gt;I'm not using &lt;code&gt;ENQUEUE_PIXEL&lt;/code&gt; because I don't want all of the extra checking it does for this one, although it wouldn't really hurt.&lt;/p&gt;

&lt;p&gt;The loop starts off pretty much the same, popping the command off the top of the stack:&lt;/p&gt;

&lt;pre&gt;    while(pixelsToExamineIndex &amp;gt; 0)
    {
        struct Command command = pixelsToExamine[--pixelsToExamineIndex];
        int x = command.x;
        int y = command.y;
&lt;/pre&gt;

&lt;p&gt;At this point, it examines the actual command and branches. If it's &lt;code&gt;EXAMINE&lt;/code&gt;, then the code is much like before. It adds the pixel to &lt;code&gt;pixelsSeen&lt;/code&gt;, checks &lt;code&gt;PixelDiff&lt;/code&gt; against &lt;code&gt;threshold&lt;/code&gt;, and fills:&lt;/p&gt;

&lt;pre&gt;        if(command.command == EXAMINE)
        {
            AddToBitmap(x, y, pixelsSeen, width);

            int diff = PixelDiff(startPixel, PIXEL(x, y));
            if(diff &amp;lt;= threshold)
            {
                PIXEL(x, y) = fillValue;
&lt;/pre&gt;

&lt;p&gt;At this point, the old code would enqueue the four adjacent pixels. Instead, we'll enqueue the two horizontally adjacent pixels, and one &lt;code&gt;ENQUEUE_VERTICALS&lt;/code&gt; command for the current pixel. We do them backwards, since the stack is last-in, first-out:&lt;/p&gt;

&lt;pre&gt;                ENQUEUE((struct Command){ ENQUEUE_VERTICALS, x, y });
                ENQUEUE_PIXEL(x - 1, y);
                ENQUEUE_PIXEL(x + 1, y);
            }
        }
&lt;/pre&gt;

&lt;p&gt;The code for an &lt;code&gt;ENQUEUE_VERTICALS&lt;/code&gt; command is simple: just enqueue the pixels above and below:&lt;/p&gt;

&lt;pre&gt;        else if(command.command == ENQUEUE_VERTICALS)
        {
            ENQUEUE_PIXEL(x, y - 1);
            ENQUEUE_PIXEL(x, y + 1);
        }
    }
&lt;/pre&gt;

&lt;p&gt;How does this optimization do? The previous version took 2.9 seconds, and this version takes only 1.9 seconds. Not too bad for the &lt;em&gt;second&lt;/em&gt; revision after what appeared to be an impasse.&lt;/p&gt;

&lt;p&gt;At this point, we've started to hit diminishing returns. The final flood fill function is about 14 times faster than the original, plenty to brag about.&lt;/p&gt;

&lt;p&gt;For those having trouble keeping track of all the changes made to the function throughout this article, here's a full listing of the final version of it:&lt;/p&gt;

&lt;pre&gt;    void FloodFill(struct Pixel *image, int width, int height, int startx, int starty, struct Pixel fillValue, int threshold)
    {
    #define PIXEL_TO_INDEX(x, y) ((x) + (y) * width)
    #define PIXEL(x, y) image[PIXEL_TO_INDEX(x, y)]

        enum {
            EXAMINE,
            ENQUEUE_VERTICALS
        };

        int length = width * height;
        int bytes = (length + CHAR_BIT - 1) / CHAR_BIT;
        uint8_t *pixelsSeen = calloc(1, bytes);

        int pixelsToExamineLength = 128;
        struct Command { int command; int x, y; };
        struct Command *pixelsToExamine = malloc(sizeof(*pixelsToExamine) * pixelsToExamineLength);
        int pixelsToExamineIndex = 0;
    #define ENQUEUE(...) do { \
            if(pixelsToExamineIndex &amp;gt;= pixelsToExamineLength) \
            { \
                pixelsToExamineLength *= 2; \
                pixelsToExamine = realloc(pixelsToExamine, sizeof(*pixelsToExamine) * pixelsToExamineLength); \
            } \
            pixelsToExamine[pixelsToExamineIndex++] = (__VA_ARGS__); \
        } while(0)

    #define ENQUEUE_PIXEL(x, y) do { \
            if((x) &amp;gt;= 0 &amp;amp;&amp;amp; (y) &amp;gt;= 0 &amp;amp;&amp;amp; (x) &amp;lt; width &amp;amp;&amp;amp; (y) &amp;lt; height) \
                if(!CheckBitmap(x, y, pixelsSeen, width)) \
                    ENQUEUE((struct Command){ EXAMINE, (x), (y) }); \
        } while(0)

        ENQUEUE((struct Command){ EXAMINE, startx, starty });
        struct Pixel startPixel = PIXEL(startx, starty);

        while(pixelsToExamineIndex &amp;gt; 0)
        {
            struct Command command = pixelsToExamine[--pixelsToExamineIndex];
            int x = command.x;
            int y = command.y;

            if(command.command == EXAMINE)
            {
                AddToBitmap(x, y, pixelsSeen, width);

                int diff = PixelDiff(startPixel, PIXEL(x, y));
                if(diff &amp;lt;= threshold)
                {
                    PIXEL(x, y) = fillValue;

                    ENQUEUE((struct Command){ ENQUEUE_VERTICALS, x, y });
                    ENQUEUE_PIXEL(x - 1, y);
                    ENQUEUE_PIXEL(x + 1, y);
                }
            }
            else if(command.command == ENQUEUE_VERTICALS)
            {
                ENQUEUE_PIXEL(x, y - 1);
                ENQUEUE_PIXEL(x, y + 1);
            }
        }

        free(pixelsSeen);
        free(pixelsToExamine);
    #undef ENQUEUE
    #undef ENQUEUE_PIXEL
    }
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;Effective optimization requires good measurements and careful thoughts. Appropriate data structures and algorithms are key, and should always be the first place you look when you need to optimize a routine. In this case, &lt;code&gt;NSIndexSet&lt;/code&gt; ended up being a pretty slow choice for pixel membership tests and queueing, and more appropriate, specialized solutions were much faster. When operating on large amounts of data, pay close attention to your memory access patterns. Linear memory access will be far faster than scattered access.&lt;/p&gt;

&lt;p&gt;Micro-optimizations should always be the choice of last resort. We could have spent ages trying to apply micro-optimizations to &lt;code&gt;ComponentDiff&lt;/code&gt; and never made any sort of measurable difference, while simply rearranging the order in which pixels were visited sped up the function by over 60%.&lt;/p&gt;

&lt;p&gt;Finally, be sure to only optimize when necessary, and what's necessary. Don't waste time optimizing code that doesn't take up a significant amount of time in the first place. Measure first, identify bottlenecks, and only then see if they might be worth improving.&lt;/p&gt;

&lt;p&gt;This flood fill function can probably be optimized further, although the easiest speed improvements are long gone by now. Profiling reveals the various &lt;code&gt;ENQUEUE&lt;/code&gt; operations to be the bottleneck in the latest version. It should be possible to improve the speed of the queue by packing values more efficiently, being smarter about the checks that are performed, etc. An algorithmic overhaul to work in terms of scanline segments rather than individual pixels is probably the best hope for gaining more speed. Most flood fill operations will have rows of many contiguous pixels, and representing the entire row as a &lt;code&gt;(startx, endx, y)&lt;/code&gt; tuple could make for a major speed win. This is known as a &lt;a href="http://will.thimbleby.net/scanline-flood-fill/"&gt;scanline flood fill&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One thing that stood out at me when looking at this last version of the flood fill was that &lt;code&gt;AddToBitmap&lt;/code&gt; is only called after a pixel is dequeued. This seems wasteful, as the same pixel could accumulate multiple entries in the command queue, which will then be wasted. This can be avoided by calling &lt;code&gt;AddToBitmap&lt;/code&gt; when enqueueing a new pixel instead of when dequeueing. However, in practice, this resulted in no measurable change in performance. I didn't look to hard to figure out why, but this is a good lesson in why measurement is essential for this kind of work.&lt;/p&gt;

&lt;p&gt;That's it for today. Come back next time for the next wacky adventure. Friday Q&amp;amp;A is driven by reader suggestions as always, so if you have any suggestions for topics you'd like to see, please &lt;a href="mailto:mike@mikeash.com"&gt;send them in&lt;/a&gt;!&lt;/p&gt;

              &lt;/body&gt;
              &lt;/html&gt;
</description><author>Mike Ash</author><guid isPermaLink="true">http://www.mikeash.com/?page=pyblog/friday-qa-2012-09-28-optimizing-flood-fill.html</guid><pubDate>Fri, 28 Sep 2012 13:17:00 GMT</pubDate></item></channel></rss>
