mikeash.com: just this guy, you know?

Posted at 2010-06-18 14:48 | RSS feed (Full text feed) | Blog Index
Next article: Blocks and GCD in Russian
Previous article: Friday Q&A 2010-05-28: Leopard Collection Classes
Tags: cocoa fridayqna hash isequal objectivec
Friday Q&A 2010-06-18: Implementing Equality and Hashing
by Mike Ash  

Welcome back to a late edition of Friday Q&A. WWDC pushed the schedule back one week, but it's finally time for another one. This week, I'm going to discuss the implementation of equality and hashing in Cocoa, a topic suggested by Steven Degutis.

Equality
Object equality is a fundamental concept that gets used all over the place. In Cocoa, it's implemented with the isEqual: method. Something as simple as [array indexOfObject:] will use it, so it's important that your objects support it.

It's so important that Cocoa actually gives us a default implementation of it on NSObject. The default implementation just compares pointers. In other words, an object is only equal to itself, and is never equal to another object. The implementation is functionally identical to:

    - (BOOL)isEqual: (id)other
    {
        return self == other;
    }
While oversimplified in many cases, this is actually good enough for a lot of objects. For example, an NSView is never considered equal to another NSView, only to itself. For NSView, and many other classes which behave that way, the default implementation is enough. That's good news, because it means that if your class has that same equality semantic, you don't have to do anything, and get the correct behavior for free.

Implementing Custom Equality
Sometimes you need a deeper implementation of equality. It's common for objects, typically what you might refer to as a "value object", to be distinct from another object but be logically equal to it. For example:

    // use mutable strings because that guarantees distinct objects
    NSMutableString *s1 = [NSMutableString stringWithString: @"Hello, world"];
    NSMutableString *s2 = [NSMutableString stringWithFormat: @"%@, %@", @"Hello", @"world"];
    BOOL equal = [s1 isEqual: s2]; // gives you YES!
Of course NSMutableString implements this for you in this case. But what if you have a custom object that you want to be able to do the same thing?
    MyClass *c1 = ...;
    MyClass *c2 = ...;
    BOOL equal = [c1 isEqual: c2];
In this case you need to implement your own version of isEqual:.

Testing for equality is fairly straightforward most of the time. Gather up the relevant properties of your class, and test them all for equality. If any of them are not equal, then return NO. Otherwise, return YES.

One subtle point with this is that the class of your object is an important property to test as well. It's perfectly valid to test a MyClass for equality with an NSString, but that comparison should never return YES (unless MyClass is a subclass of NSString, of course).

A somewhat less subtle point is to ensure that you only test properties that are actually important to equality. Things like caches that do not influence your object's externally-visible value should not be tested.

Let's say your class looks like this:

    @interface MyClass : NSObject
    {
        int _length;
        char *_data;
        NSString *_name;
        NSMutableDictionary *_cache;
    }
Your equality implementation would then look like this:
    - (BOOL)isEqual: (id)other
    {
        return ([other isKindOfClass: [MyClass class]] &&
                [other length] == _length &&
                memcmp([other data], _data, _length) == 0 &&
                [[other name] isEqual: _name])
                // note: no comparison of _cache
    }
Hashing
Hash tables are a commonly used data structure which are used to implement, among other things, NSDictionary and NSSet. They allow fast lookups of objects no matter how many objects you put in the container.

If you're familiar with how hash tables work, you may want to skip the next paragraph or two.

A hash table is basically a big array with special indexing. Objects are placed into an array with an index that corresponds to their hash. The hash is essentially a pseudorandom number generated from the object's properties. The idea is to make the index random enough to make it unlikely for two objects to have the same hash, but have it be fully reproducible. When an object is inserted, the hash is used to determine where it goes. When an object is looked up, its hash is used to determine where to look.

In more formal terms, the hash of an object is defined such that two objects have an identical hash if they are equal. Note that the reverse is not true, and can't be: two objects can have an identical hash and not be equal. You want to try to avoid this as much as possible, because when two unequal objects have the same hash (called a collision) then the hash table has to take special measures to handle this, which is slow. However, it's provably impossible to avoid it completely.

In Cocoa, hashing is implemented with the hash method, which has this signature:

    - (NSUInteger)hash;
As with equality, NSObject gives you a default implementation that just uses your object's identity. Roughly speaking, it does this:
    - (NSUInteger)hash
    {
        return (NSUInteger)self;
    }
The actual value may differ, but the essential point is that it's based on the actual pointer value of self. And just as with equality, if object identity equality is all you need, then the default implementation will do fine for you.

Implementing Custom Hashing
Because of the semantics of hash, if you override isEqual: then you must override hash. If you don't, then you risk having two objects which are equal but which don't have the same hash. If you use these objects in a dictionary, set, or something else which uses a hash table, then hilarity will ensue.

Because the definition of the object's hash follows equality so closely, the implementation of hash likewise closely follows the implementation of isEqual:.

An exception to this is that there's no need to include your object's class in the definition of hash. That's basically a safeguard in isEqual: to ensure the rest of the check makes sense when used with a different object. Your hash is likely to be very different from the hash of a different class simply by virtue of hashing different properties and using different math to combine them.

Generating Property Hashes
Testing properties for equality is usually straightforward, but hashing them isn't always. How you hash a property depends on what kind of object it is.

For a numeric property, the hash can simply be the numeric value.

For an object property, you can send the object the hash method, and use what it returns.

For data-like properties, you'll want to use some sort of hash algorithm to generate the hash. You can use CRC32, or even something totally overkill like MD5. Another approach, somewhat less speedy but easy to use, is to wrap the data in an NSData and ask it for its hash, essentially offloading the work onto Cocoa. In the above example, you could compute the hash of _data like so:

    [[NSData dataWithBytes: _data length: _length] hash]
Combining Property Hashes
So you know how to generate a hash for each property, but how do you put them together?

The easiest way is to simply add them together, or use the bitwise xor property. However, this can hurt your hash's uniqueness, because these operations are symmetric, meaning that the separation between different properties gets lost. As an example, consider an object which contains a first and last name, with the following hash implementation:

    - (NSUInteger)hash
    {
        return [_firstName hash] ^ [_lastName hash];
    }
Now imagine you have two objects, one for "George Frederick" and one for "Frederick George". They will hash to the same value even though they're clearly not equal. And, although hash collisions can't be avoided completely, we should try to make them harder to obtain than this!

How to best combine hashes is a complicated subject without any single answer. However, any asymmetric way of combining the values is a good start. I like to use a bitwise rotation in addition to the xor to combine them:

    #define NSUINT_BIT (CHAR_BIT * sizeof(NSUInteger))
    #define NSUINTROTATE(val, howmuch) ((((NSUInteger)val) << howmuch) | (((NSUInteger)val) >> (NSUINT_BIT - howmuch)))
    
    - (NSUInteger)hash
    {
        return NSUINTROTATE([_firstName hash], NSUINT_BIT / 2) ^ [_lastName hash];
    }
Custom Hash Example
Now we can take all of the above and use it to produce a hash method for the example class. It follows the basic form of the equality method, and uses the above techniques to obtain and combine the hashes of the individual properties:
    - (NSUInteger)hash
    {
        NSUInteger dataHash = [[NSData dataWithBytes: _data length: _length] hash];
        return NSUINTROTATE(dataHash, NSUINT_BIT / 2) ^ [_name hash];
    }
If you have more properties, you can add more rotation and more xor operators, and it'll work out just the same. You'll want to adjust the amount of rotation for each property to make each one different.

A Note on Subclassing
You have to be careful when subclassing a class which implements custom equality and hashing. In particular, your subclass should not expose any new properties which equality is dependent upon. If it does, then it must not compare equal with any instances of the superclass.

To see why, consider a subclass of the first/last name class which includes a birthday, and includes that as part of its equality computation. It can't include it when comparing equality with an instance of the superclass, though, so its equality method would look like this:

    - (BOOL)isEqual: (id)other
    {
        // if the superclass doesn't like it then we're not equal
        if(![super isEqual: other])
            return NO;
        
        // if it's not an instance of the subclass, then trust the superclass
        // it's equal there, so we consider it equal here
        if(![other isKindOfClass: [MySubClass class]])
            return YES;
        
        // it's an instance of the subclass, the superclass properties are equal
        // so check the added subclass property
        return [[other birthday] isEqual: _birthday];
    }
Now you have an instance of the superclass for "John Smith", which I'll call A, and an instance of the subclass for "John Smith" with a birthday of 5/31/1982, which I'll call B. Because of the definition of equality above, A equals B, and B also equals itself, which is expected.

Now consider an instance of the subclass for "John Smith" with a birthday of 6/7/1994, which I'll call C. C is not equal to B, which is what we expect. C is equal to A, also expected. But now there's a problem. A equals both B and C, but B and C do not equal each other! This breaks the standard transitivity of the equality operator, and leads to extremely unexpected results.

In general this should not be a big problem. If your subclass adds properties which influence object equality, that's probably an indication of a design problem in your hierarchy anyway. Rather than working around it with weird implementations of isEqual:, consider redesigning your class hierarchy.

A Note on Dictionaries
If you want to use your object as a key in an NSDictionary, you need to implement hashing and equality, but you also need to implement -copyWithZone:. Techniques for doing that are beyond the scope of today's post, but you should be aware that you need to go a little bit further in that case.

Conclusion
Cocoa provides default implementations of equality and hashing which work for many objects, but if you want your objects to be considered equal even when they're distinct objects in memory, you have to do a bit of extra work. Fortunately, it's not difficult to do, and once you implement them, your class will work seamlessly with many Cocoa collection classes.

That's it for this week. Check back in two more weeks for another edition. Until then, keep sending me your requests for topics. Friday Q&A is driven by user submissions, so if you have an idea for a topic to cover here, please send it in.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

A related pitfall: it's easy to assume that the basic Foundation types will have good, general -hash implementations, but in fact they don't. For the collection types, -hash is equal to -count.

There are good reasons for this, but it's a potentially major pitfall. Concretely, in a recent case that involved uniquing several million short arrays, I reduced runtime from over twelve hours to eleven seconds by noticing this and implementing custom hashing and comparison.

Another fascinatin' anecdote: prior to Snow Leopard, the -hash value of boolean NSNumbers did not match that of integer NSNumbers with value 0 and 1, but -isEqual: claimed they were equal.
Ahruman: this is some fascinating info, and this combined with a growing collection of other observations about Cocoa on Mike's blog is really starting to erode my confidence in the robustness and stability of the Cocoa frameworks... :/
Ahruman: A fine point. And if you need a better hashing algorithm but don't feel like overriding what Cocoa does universally, you may benefit from custom callbacks in an NSMapTable or NSHashTable as detailed in my previous post:

http://www.mikeash.com/pyblog/friday-qa-2010-05-28-leopard-collection-classes.html

Steven Degutis: In general, my trust of a system is inversely proportional to my knowledge of a system. Some key exceptions are GCD and libc, but even those aren't as solid as I'd like.
Nice article, and one that I wish I'd had a few years ago. Another useful article is at http://www.mulle-kybernetik.com/artikel/Optimization/opti-7.html (although slightly dated). I implemented -hash and -isEqual: improperly once upon a time, and it led to random crashes under very hard-to-reproduce circumstances. These are definitely worth getting right!

Prior to Snow Leopard, CFURL was also terrible in a hashing collection because it would end up copying and manipulating strings each time the hash function was called. It looks like it's computed once and stored inline, now, which should be much better for performance.
Thanks for choosing a great topic, and a great article! I followed the advice at http://stackoverflow.com/questions/254281/ to implement my hash methods, using a prime number (31) to multiply each component of the hash before addition. What are your thoughts on that?

Also, adding boolean variables to a hash is needed quite often. The article suggests (var) ? 1231 : 1237 for those.

Good article as always Mike, though you didn't answer one question that's always bugged me. The Cocoa docs on -[NSObject hash] says this:

http://developer.apple.com/mac/library/documentation/cocoa/reference/foundation/Protocols/NSObject_Protocol/Reference/NSObject.html#//apple_ref/occ/intfm/NSObject/hash

"If a mutable object is added to a collection that uses hash values to determine the object's position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object's internal state information or you must make sure the object's internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)"

Well, umm. If the hash method cannot rely on any of the object's internal state that _may_ change throughout it's lifetime, this means that writing -hash for mutable objects becomes far more intricate. (I know why the restriction's there; presumably it's one of the reasons why NSDictionary copies its keys rather than simply retaining them. Dunno about NSMapTable though.)

I took a very braindead approach for RMModelObject's -hash method (sum the pointers for all of the object's ivars) because I felt it was the only safe way to keep within the Cocoa spec that'd still generate somewhat useful hash values.

Thoughts?

P.S. Your commenting doesn't support unicode? It doesn't like the e-acute in my name :).
Andre, yes - using mutable objects as dictionary keys is the path to madness (if the object can mutate while active as a dict key). That's one reason why people commonly use NSStrings or some other key that reasonably uniquely identifies an object as dictionary keys.
Frederik: That is probably a good technique. Overall it can be hard to measure the effectiveness of this stuff. I tend to go with a good balance between ease of coding and whatever looks like it will do a good job of mixing up the bits.

André Pang: The part that says "or you must make sure the object's internal state information does not change while the object is in the collection" is key. It's usually much easier, and much better, to simply not mutate your mutable keys while they're being used as keys, than to try to figure out how to make their hash not depend on their mutable value. Mutating a key while it's in a collection is probably another code smell.

Regarding Unicode, looks like I broke it when I added comment previews. Should be fixed now.
Mike, what about mutable objects in an NSSet?
For holding mutable, mutated objects, use an NSArray. Performance will suffer, but that's probably true of any collection holding changing objects.
You can use mutable objects in an NSSet as long as you implement the hash method such that the value it returns is not based on the mutable properties. An example of this would be if you had mutable objects that have an immutable primary key such as a GUID or sone other identifier. You can even just use the pointer value if instance identity is sufficient. Obviously this is something that you have to decide whether or not is appropriate for your your design; it would allow you to insert another "equivalent" object into the set which may or may not be "correct".
Great point on NSSet: using mutable objects in an NSSet is probably the most likely thing that one could do without realizing (at least for NSDictionary, the key is copied, so even mutable objects will be made immutable). At least I had never thought of that issue.
it's provably impossible to avoid it completely


Not sure what you mean by "impossible". If the keys are known in advance, a perfect hash avoids all collisions:

http://en.wikipedia.org/wiki/Perfect_hash_function
Actually it's "completely" that applies here. You can avoid hash collisions for special cases, but you can't avoid them in all cases all of the time. Your hash table either has to be very special-purpose (as in the case of using a perfect hash because you know all of the keys in advance) or it needs to be able to handle collisions.
Trevor: The hash function only returns a NSUInteger, that means there are at max 64 bits of unique numbers there. A NSString can have more possible combinations than that (think of a hash table of 100 character strings). There MUST be some collisions if you have more than 2^64 distinct values stored in a hash table.
Note that it is impossible to have a hash table with 2^64 distinct values, because you'd run out of address space long beforehand. It's not even possible in theory, as long as your hash value integer is the same size as your pointers.

The problem is not when you actually have more than 2^64 live values, but when you have more than 2^64 possible values. Since it's impractical to alter your hash function in the middle of running your program, it's possible to find two objects with the same hash and have to deal with the collision, even when you have far fewer than 2^64 live values.
In the case one of the properties is allowed to be nil, this method is incomplete: when you have two objects of class A with a property p that is nil on both, [x.p isEqual:y.p] will return false. I also wrote about this here: http://chris.eidhof.nl/post/38065014421/implementing-value-objects-in-objective-c
For several years I've had my doubts about how others have explained or justified non-transitivity of equality with respect to sub-classes. The first paragraph of "A note on subclassing" has finally put that to rest. Thanks.
In the paragraph A Note on Subclassing you wrote at the end Rather than working around it with weird implementations of isEqual:, consider redesigning your class hierarchy.

What if, I want to keep the similar hierarchy as in this case which is not bad at all then how to deal with this transitivity problem?
Sandeep,

In the root class (or as an extension of NSObject) you could add the following method:


- (BOOL)isMostSpecializedEqual:(NSObject *)anObject
{
    if (anObject == self)
        return YES;

    NSObject *parent = nil;
    NSObject *child = nil;

    if ([self isKindOfClass:anObject.class])
    {
        parent = anObject;
        child = self;
    }
    else if ([anObject isKindOfClass:self.class])
    {
        parent = self;
        child = anObject;
    }
    else
        return NO;

    return [child isEqual:parent];
}


then implement -isEqual:


- (BOOL)isEqual:(id)anObject
{
    if (anObject == self)
        return YES;
    else if ([self isKindOfClass:anObject.class])
        // perform memberwise-equality checks
    else
        return [self SR_isMostSpecializedEqual:anObject]
}


This way of isEqual: of the most specialized class in the hierarchy will be consistently used.
So if at some point one of the subclasses will introduce a state change (e.g. additional property)
it's own isEqual: will consistently judge.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.