mikeash.com: just this guy, you know?

Posted at 2012-03-16 14:55 | RSS feed (Full text feed) | Blog Index
Next article: Introducing PLWeakCompatibility
Previous article: Friday Q&A 2012-03-09: Let's Build NSMutableArray
Tags: fridayqna letsbuild objectivec
Friday Q&A 2012-03-16: Let's Build NSMutableDictionary
by Mike Ash  

Last time on Friday Q&A, we discussed how to implement NSMutableArray. Today, I'll repeat the same exercise with NSMutableDictionary and build an implementation of it from scratch.

Concepts
Like NSMutableArray, NSMutableDictionary is a class cluster. Any subclass must implement these primitive methods:

    - (NSUInteger)count;
    - (id)objectForKey:(id)aKey;
    - (NSEnumerator *)keyEnumerator;
    - (void)removeObjectForKey:(id)aKey;
    - (void)setObject:(id)anObject forKey:(id)aKey;

Most of these should be pretty obvious. One which may not be is -keyEnumerator. Some of the NSDictionary APIs implemented on top of these methods need to be able to examine all of the dictionary's keys. For example objectsForKeys:notFoundMarker: would be implemented by iterating over keyEnumerator, using objectForKey: to get the corresponding object, and seeing if it matched the one passed in. Requiring an NSArray or NSSet would be limiting, so instead it simply requires returning some object which can be enumerated to find all of the keys.

There are several ways to implement a key-value mapping that this interface represents. The most natural for this is probably a hash table, since the only thing we can really know about the keys is that they implement NSCopying, hash, and isEqual:. A binary tree is another common way to implement key-value mappings, but with no comparison method available, it doesn't work too well here. It would be possible to implement the binary tree based on the hash value, but implementing a hash table is easier and more likely to match the actual implementation of NSMutableDictionary.

For those of you who've forgotten your college data structures course, let's talk about just what a hash table is before we go off and implement one.

First, we need a hash function. A hash function is something that takes an object and produces an integer where, for two equal values x and y, hash(x) = hash(y). In other words, when two objects have the same value, they have the same hash. If the two objects have different values, you prefer that they have different hashes, but this is not a requirement. A good hash function will try to make that happen if it's at all possible, because it makes hash tables run faster. However, return 0; is a perfectly valid hash function, just slow. In Cocoa, the hash method on NSObject provides this hash function. Subclasses which have more complicated equality semantics will also override hash to match.

With the hash function in hand, the next step is to construct a table which uses the hash function to splat keys into it in a random-looking fashion. The table can just be a C array. We choose an index into the table based on the hash function, usually by taking the modulus of the hash with the table length. That index of the table is then used to store the key/value pair.

A major question at this point is how to handle collisions. A collision can happen for two different objects with the same hash, or just for two different objects which happen to be assigned to the same index in the table but have different hashes. There are a lot of different ways to handle this problem, but a common and easy one is to have each table entry actually hold the head of a linked list rather than a single key-value pair.

Thus, to look up a key in the table, we use the hash function to find the index, then search the linked list for that key. To set a new key-value pair, we just add the new pair to the linked list at that index.

An interesting aspect of this approach is that resizing the table becomes somewhat optional and arbitrary. A small table can still hold a large number of objects, because they'll all end up in the table's linked lists. However, performance suffers greatly when the linked lists become long. For optimal performance, you want to grow the table as the number of key-value pairs grows, but it's not strictly necessary.

In order to make the implementation of the dictionary easier, I split the implementation into two separate classes. MAFixedMutableDictionary implements the above approach using a fixed-size table, whose size is specified at initialization time. MAMutableDictionary is then a small wrapper around MAFixedMutableDictionary which creates a new, larger one and copies the key-value pairs across each time the size exceeds a certain threshold.

Code
Like before, the code that we'll discuss today is available on GitHub:

https://github.com/mikeash/MACollections

Implementation
The first thing we'll implement is the linked list node, which I call _MAMutableDictionaryBucket. This is a really simple class whose task is to hold a key, a value, and a pointer to the next bucket. Here is the entire implementation:

    @interface _MAMutableDictionaryBucket : NSObject
    @property (nonatomic, copy) id key;
    @property (nonatomic, retain) id obj;
    @property (nonatomic, retain) _MAMutableDictionaryBucket *next;
    @end

    @implementation _MAMutableDictionaryBucket

    - (void)dealloc
    {
        [_key release];
        [_obj release];
        [_next release];
        [super dealloc];
    }

    @end

Note that I'm taking advantage of the new default @synthesize stuff which is why there is no visible implementation of those three @property declarations. Also note that the key property is declared as copy, which trivially implements the NSMutableDictionary semantic of copying its keys.

I also have a second helper class. This one is an NSEnumerator subclass which wraps a block that acts as the enumerator. I use this to make it easier to implement the keyEnumerator method. Here's the code for this class:

    @interface _MABlockEnumerator : NSEnumerator
    {
        id (^_block)(void);
    }
    - (id)initWithBlock: (id (^)(void))block;
    @end

    @implementation _MABlockEnumerator

    - (id)initWithBlock: (id (^)(void))block
    {
        if((self = [self init]))
            _block = [block copy];
        return self;
    }

    - (void)dealloc
    {
        [_block release];
        [super dealloc];
    }

    - (id)nextObject
    {
        return _block();
    }

    @end

That's it for helper classes. Next, let's look at the instance variables for MAFixedMutableDictionary:

    @implementation MAFixedMutableDictionary {
        NSUInteger _count;
        NSUInteger _size;
        _MAMutableDictionaryBucket **_array;
    }

This should all be pretty straightforward. _count stores the number of objects currently in the table. _size stores the size of the table, and _array actually is the table, implemented as an array of pointers to buckets, acting as the linked list heads. The initializer is likewise simple, setting _size, allocating _array, and leaving _count at 0:

    - (id)initWithSize: (NSUInteger)size
    {
        if((self = [super init]))
        {
            _size = size;
            _array = calloc(size, sizeof(*_array));
        }
        return self;
    }

dealloc is just the opposite, with the minor addition of needing to iterate through the table and free all of the buckets it contains. Since each bucket manages its next pointer's memory, there's no need to iterate through the list and manually free all of the nodes, as you may have done in some far-off data structures class. Instead, this happens automatically as a consequence of the standard memory management implemented in _MAMutableDictionaryBucket:

    - (void)dealloc
    {
        for(NSUInteger i = 0; i < _size; i++)
            [_array[i] release];
        free(_array);

        [super dealloc];
    }

The count method is really simple, since it just needs to return the instance variable:

    - (NSUInteger)count
    {
        return _count;
    }

Lookup
Now the code starts to get more complex. Next up is objectForKey:. The implementation starts off by calculating the index of the appropriate bucket by taking the hash of the key and using modulus to get it within the size of the table:

    - (id)objectForKey: (id)key
    {
        NSUInteger bucketIndex = [key hash] % _size;

Next, we retrieve the bucket at that index:

        _MAMutableDictionaryBucket *bucket = _array[bucketIndex];

Now, we want to loop over all every element in the linked list that begins with bucket and search for key. We loop while bucket is non-nil, and return the appropriate object if we find a matching key:

        while(bucket)
        {
            if([[bucket key] isEqual: key])
                return [bucket obj];

Otherwise, move to the next bucket and keep looking:

            bucket = [bucket next];
        }

Finally, if no matching bucket was found, return nil:

        return nil;
    }

Enumeration
The strategy for keyEnumerator is to iterate over the table. For each linked list we find in the table, we iterate over its nodes, then resume iterating the table. We'll use _MABlockEnumerator to implement this enumeration strategy with a block.

This block needs to be able to keep track of the current table index it's examining as well as the current list node. Since these need to persist between calls to the block and need to be modified in the block, they're implemented as __block local variables outside the block:

    - (NSEnumerator *)keyEnumerator
    {
        __block NSUInteger index = -1;
        __block _MAMutableDictionaryBucket *bucket = nil;

These are initialized with values that make the enumerator start looking at the very beginning of the table. With those in place, we can implement the enumerator block:

        NSEnumerator *e = [[_MABlockEnumerator alloc] initWithBlock: ^{

The bucket variable holds the bucket that was current on the last call to the block. Thus, the first thing the block does is move to the next bucket:

            bucket = [bucket next];

If we've fallen off the end of the list, then we need to move through the table and find the next list. We loop through the table searching for an index where _array does not contain nil. If we run off the end of the table while searching, then there are no more elements to enumerate, so we return nil:

            while(!bucket)
            {
                index++;
                if(index >= _size)
                    return (id)nil;
                bucket = _array[index];
            }

If we avoid returning nil, then bucket now contains the next key-value pair to enumerate. We simply return the key from that bucket:

            return [bucket key];

With the enumeration block complete, all that remains is to return it:

        }];
        return [e autorelease];
    }

Insertion and Removal
The next method to implement is removeObjectForKey:. This works much like objectForKey:, except that once it finds the appropriate bucket, it removes that bucket from the list rather than simply returning its object. Thus, it starts out the same, by finding the appropriate table index and the bucket contained there:

    - (void)removeObjectForKey: (id)key
    {
        NSUInteger bucketIndex = [key hash] % _size;
        _MAMutableDictionaryBucket *previousBucket = nil;
        _MAMutableDictionaryBucket *bucket = _array[bucketIndex];

However, because it needs to remove the bucket it finds from the list, the loop is a little more complex. The previousBucket variable here will track the bucket before the current one. That bucket's next property needs to be re-pointed when removing the current bucket, and there's no easy way to move backwards in the list, so we keep track of it separately. We initialize it with nil which we'll use to handle the special case where the bucket to remove is at the head of the list.

We now loop through the list and check for a matching key:

        while(bucket)
        {
            if([[bucket key] isEqual: key])
            {

Once a matching bucket is found, there are two cases to handle. The first is the special case where the bucket is at the very beginning of the list. For this case, we need to set _array[bucketIndex] to point to the next bucket, thus removing the bucket we found from the list, and add the appropriate retain and release calls to manage the memory:

                if(previousBucket == nil)
                {
                    _MAMutableDictionaryBucket *nextBucket = [[bucket next] retain];
                    [_array[bucketIndex] release];
                    _array[bucketIndex] = nextBucket;
                }

Otherwise, we just set previousBucket's next property to the current bucket's next, which cuts the current bucket out of the list. All memory management in this case is handled automatically by the bucket's @property code:

                else
                {
                    [previousBucket setNext: [bucket next]];
                }

In both cases, once the appropriate bucket is removed from the linked list, we simply decrement _count to take into account the fact that one less entry is present, then return from the method:

                _count--;
                return;
            }

If no bucket is found, we just keep searching. We advance previousBucket to bucket and then advance bucket to the next bucket:

            previousBucket = bucket;
            bucket = [bucket next];
        }
    }

And that's it. In the event that no such key exists, the loop just falls off the end of the linked list and the method returns, having done nothing.

Finally, we come to setObject:forKey:. The first thing we'll do is set up a new bucket for the new key-value pair:

    - (void)setObject: (id)obj forKey: (id)key
    {
        _MAMutableDictionaryBucket *newBucket = [[_MAMutableDictionaryBucket alloc] init];
        [newBucket setKey: key];
        [newBucket setObj: obj];

Next, since this method is supposed to replace any existing object that exists in the dictionary for key, we'll remove the previous object, if any:

        [self removeObjectForKey: key];

Finally, we insert the new bucket into the array by computing index, setting the next property to the current bucket at that index, and then setting the array to contain the new bucket:

        NSUInteger bucketIndex = [key hash] % _size;
        [newBucket setNext: _array[bucketIndex]];
        [_array[bucketIndex] release];
        _array[bucketIndex] = newBucket;

All that's left now is to increment _count and return:

        _count++;
    }

There's a subtle trick in the order of the above code. It's possible that either obj or key are only being kept alive by a strong reference from this dictionary instance. For example, one might write something like:

    [dict setObject: [dict objectForKey: key] forKey: key];

Although this particular example is pointless, it's legal, and you can run into equivalent situations that are much less obvious. If the first thing we did was removeObjectForKey:, as might seem natural, we risk invalidating either obj or key and causing the rest of the code to crash. Instead, the first thing we do is create newBucket and set its key and obj properties. This retains key and obj and ensures that they stay alive even after removeObjectForKey: runs.

Resizing
With MAFixedMutableDictionary complete, we now just implement MAMutableDictionary on top of it. This is simply a wrapper which keeps track of the current inner MAFixedMutableDictionary, as well as its table size:

    @implementation MAMutableDictionary {
        NSUInteger _size;
        MAFixedMutableDictionary *_fixedDict;
    }

This dictionary will create a new, larger dictionary whenever the current one gets too full. How full is too full? In hash table terms, the ratio of the number of hash table entries to the size of the hash table is called the load factor. As we discussed before, MAFixedMutableDictionary will keep working no matter how full it gets, it simply becomes slower and slower. Just when to resize the dictionary is not entirely clear. It's a classic time/space tradeoff, where resizing the table at a lower load factor keeps the table faster at the cost of more wasted space, and waiting for a higher load factor makes the table slower but wastes less space. In theory, a good hash function should keep the table fast up to about a 0.7 load factor, so that's where we'll resize it. Here are a pair of constants that define the load factor where the table will resize:

    static const NSUInteger kMaxLoadFactorNumerator = 7;
    static const NSUInteger kMaxLoadFactorDenominator = 10;

On to the code. The first method is the initializer. We implement initWithCapacity:, as it's a funnel method that other NSMutableDictionary methods will call through. We use the capacity as the initial size for the underlying fixed dictionary, with 4 as an arbitrary minimum:

    - (id)initWithCapacity: (NSUInteger)capacity
    {
        capacity = MAX(capacity, 4);
        if((self = [super init]))
        {
            _size = capacity;
            _fixedDict = [[MAFixedMutableDictionary alloc] initWithSize: _size];
        }
        return self;
    }

For dealloc, all it needs to do is release the underlying fixed dictionary:

    - (void)dealloc
    {
        [_fixedDict release];
        [super dealloc];
    }

Similarly, most of the primitive methods are simply wrappers around the _fixedDict:

    - (NSUInteger)count
    {
        return [_fixedDict count];
    }

    - (id)objectForKey: (id)key
    {
        return [_fixedDict objectForKey: key];
    }

    - (NSEnumerator *)keyEnumerator
    {
        return [_fixedDict keyEnumerator];
    }

    - (void)removeObjectForKey: (id)key
    {
        [_fixedDict removeObjectForKey: key];
    }

The only really interesting method is setObject:forKey:. It, too, calls through to _fixedDict first:

    - (void)setObject: (id)obj forKey:(id)key
    {
        [_fixedDict setObject: obj forKey: key];

However, this is also where it will reallocate the underlying storage if necessary. The first thing it does is see if the current load factor has exceeded the maximum allowed:

        if(kMaxLoadFactorDenominator * [_fixedDict count] / _size > kMaxLoadFactorNumerator)
        {

If it does, the first thing it does is create a new MAFixedMutableDictionary with a larger size. This size is determined by doubling the previous size, although many other strategies could be used instead:

            NSUInteger newSize = _size * 2;
            MAFixedMutableDictionary *newDict = [[MAFixedMutableDictionary alloc] initWithSize: newSize];

With the new fixed-size dictionary created, the next thing to do is to copy all of the key-value pairs across from the old one:

            for(id key in _fixedDict)
                [newDict setObject: [_fixedDict objectForKey: key] forKey: key];

With everything copied over, all that remains is to release the old dictionary and reassign the instance variables:

            [_fixedDict release];
            _size = newSize;
            _fixedDict = newDict;
        }
    }

That's it! With this simple wrapper, we now have a resizing hash table implementation of NSMutableDictionary.

In a real implementation, we would probably want to do a similar resizing operation in the opposite direction in removeObjectForKey:, to prevent too much space from being wasted if the dictionary is emptied. However, for this example implementation, I left it out, as it's even more optional than increasing the size as more data is added.

Sets
I won't be showing an implementation of NSMutableSet, but from this discussion it should be pretty clear how we could go about creating one. Conceptually, a set is basically just a dictionary with no objects, only keys. The above code could easily be used to implement NSMutableSet by simply treating each set member as a key, and having the object be a placeholder. Better would be to remove the objects from the code altogether, and do the same basic operations on the key alone, thus wasting a little less space.

Performance
If the hash function is well behaved, and keys are very likely to be assigned to different table indexes, the dictionary will be fast, assuming that hash itself is fast. Examining, adding, or deleting a single entry in the table gives \(\mathrm{O(1)}\) behavior for these operations in that case.

In the worst case, every key has the same table index, as would happen if the hash function were implemented as return 0;. In this case, the hash table degenerates into a single linked list. All operations on this linked list are \(\mathrm{O(n)}\), giving similar performance to an array.

In most real-world cases, the hash function will behave well enough to achieve \(\mathrm{O(1)}\) performance on most data. When writing Cocoa code, we can usually assume that this is the case for our NSMutableDictionary objects.

There is an interesting class of denial of service attacks based on this. Even with a well behaved hash function, it's often easy to find distinct values with the same hash or table index. By feeding specially crafted data to a program which assumes its hash tables are always fast, it's possible to cause that program to spend a great deal of time fiddling with its hash tables and even overload the server where the program is running. For more information on these, see the paper Denial of Service via Algorithmic Complexity Attacks.

Implications
This implementation can clarify some weird behaviors that may occur with NSMutableDictionary and NSMutableSet when they're misused.

Apple strongly cautions against putting mutable objects in sets or using them as dictionary keys, and NSMutableDictionary copies its keys in an attempt to avoid that. Now that we have some simple code to look at, it should become pretty clear as to why this is a problem. If a key's value changes, its hash is also likely to change. When the hash changes, the table index is likely to change as well, but the _MAMutableDictionaryBucket object stays where it was.

The net result is that the key-value pair becomes invisible to objectForKey: and similar. However, it will still show up in count and keyEnumerator. If you use keyEnumerator, you may encounter a bizarre situation where it returns a key which the dictionary then claims it doesn't contain! Furthermore, if you use setObject:forKey:, you'll likely end up with two copies of the key in the dictionary. When enumerating over the dictionary, the key will show up twice, but both will appear to have the same object, with the other object inaccessible. When the dictionary resizes, one of the entries will disappear, and which one "wins" will be essentially random.

Similar things happen if you have a bad implementation of hash, where two equal objects have different hash values. This can easily happen if you subclass NSObject, reimplement isEqual:, but don't reimplement hash. In cases like this, looking up a key-value pair will appear to succeed or fail essentially at random, depending on whether you get lucky and hit the same table index or not. Every time the table resizes, entries will get shuffled around and previous successes may start failing, and vice versa. You may once again end up with duplicate entries in the table, and they may or may not disappear as the table resizes.

Conclusion
We've now seen examples of how to implement all three main Cocoa collection classes, NSMutableArray, NSMutableDictionary, and NSMutableSet (as a minor variant of NSMutableDictionary). While the real implementations are undoubtedly different from what we see here, these examples come close enough to have a good understanding of how this stuff works on the inside.

That's it for today. Come back next time for another exciting journey to the dark interior of the programming soul. Until then, since Friday Q&A is driven by reader suggestions for topics, please send in your ideas!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Enjoyed reading these as usual. One editorial note:

However, return 0; is a perfectly valid hash function, just slow. In Cocoa, the hash method on NSObject provides this hash function.


My initial reading of this was that NSObject's hash function was return 0;. Maybe it'd be clearer if you mentioned something about the object's address?
I believe there is a risk with this line:

[previousBucket setNext: [bucket next]];

if the setter is written in certain legitimate ways, eg as shown in http://cocoadev.com/wiki/AccessorMethods:


   if (newValue != _var) {
       [_var release];
       _var = [newValue retain];
   }


If the setter releases its _next variable before retaining the input parameter, then since [bucket next] has not retained it, it will be a dangling pointer.

In this case you're probably safe since the setter is synthesized and Apple's code will be appropriately defensive, but still it's worth either pointing out the danger or defending against it with your own retain/release, or implementing your own setter with defined safe semantics.
I think you're right that this is a potential danger, although not a realistic one when using the default @synthesize implementation. I never quite realized that problem with the setter pattern you show, but it makes me glad that I've always written mine to do the retain first:

[newValue retain];
[_var release];
_var = newValue;

It also makes me glad that ARC saves us from having to worry about stuff like that.
Can you update this code to be ARC-complaint ?

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.