mikeash.com: Friday Q&A 2009-10-09: Defensive Programming

Posted at 2009-10-09 16:03 | RSS feed (Full text feed) | Blog Index
Next article: XBolo is Out!
Previous article: Friday Q&A 2009-10-02: Care and Feeding of Singletons
Tags: defensive fridayqna

Friday Q&A 2009-10-09: Defensive Programming

by Mike Ash

It's that time of the week again. This week I'm going to discuss defensive programming, a topic suggested by Ed Wynne.

Evolution of the Programmer
If you're like most programmers, your first enemy when learning to program was the compiler. You'd type in a perfectly good program, but the thing would spit out these cryptic errors at you. You'd go poring over the code looking for that missing semicolon or misspelled variable.

Eventually, you tamed the compiler. Your programs maybe didn't compile right away, but writing syntactically acceptable code became routine. Your goal then moved on to writing programs that behaved correctly. You put in 2+2 and it gives you 5, oops. Once you found the broken code and it gave you 4, you were overjoyed.

A lot of programmers stop here. There are plenty of refinements to be had, here. Algorithms, data structures, design patterns. All are different ways of approaching the task of building a program that behaves correctly. Writing correctly behaving programs becomes routine, even when they don't always behave properly right away, and that's as far as it goes.

But there's one more level, one which many programmers aren't even really aware of. At this level, the goal is to write programs which fail gracefully. This takes even more thought and care than making programs which behave correctly, and this is what I want to address today.

What Happens If It Fails?
This question is the key to writing programs which fail gracefully. Let's take a fairly common line that you might find in any average Cocoa program:

    NSData *data = [NSData dataWithContentsOfFile: path];

Cocoa makes this so easy that we can be lulled into a false sense of security. How much can go wrong from one line of code?

Actually, quite a lot:

The file is unreadable: Maybe the file doesn't exist at that path, maybe it exists but the permissions don't allow you to read it, etc.
The file is truncated or empty: Maybe the file got trashed by another program. Or by yours!
The file contains an unexpected data format: Especially possible for files outside your app bundle.
The file contains unexpected data in the format you want: Ditto.
The file contains an enormous amount of data: 2TB drives can be had for well under $200 now, and a single file could be enormous. What happens if the file you're pointing at is 2TB long?
The file takes an unreasonable amount of time to read: Network filesystems like AFP are extremely common these days, and networks can be slow.

How many of these failure modes does a typical Cocoa program actually handle in any sort of explicit fashion? Generally zero. Depending on the context, these failures could lead to a freeze, a crash, weird behavior, or nothing going wrong at all.

Even extremely mundane code can "fail". For example:

    int x = y + z;

What happens if the sum of y + z is greater than INT_MAX, or less than INT_MIN? The result, while not a "failure" in the sense of a freeze or crash, may not be what the code expects.

Ways to Fail
There are a lot of ways that a program can respond to a failure. Ranked from worst to best:

Corrupt/delete user data
Crash/freeze
Fail silently
Display an error
Work around the failure

It should be obvious why #1 is the worst. No matter how much your program crashes or fails to do its job, the worst it can do is be useless. But if you destroy your user's data, then your program can actually acquire negative value. When he discover the culprit, the user will wish he had never tried your program, and he will tell all of his friends about this.

Everything after #1 is acceptable to some degree. Working around the failure isn't always possible; what if the user is opening a file and the file isn't readable? Ideally, displaying an error is the worst that would ever happen. In reality, it's not practical to trap every failure so that you can display an error message.

Working Around Failures
How and whether this is possible will depend entirely on what you're doing, so I can't say much beyond generalities.

If there are multiple ways to accomplish the same task, then you can write code to try each way in sequence. It's pointless to do this if all the different ways funnel through the same mechanism in the end; there's no reason to fall back to open/read if your NSData file reader fails, for example. For a case where it's worthwhile, imagine saving a file with some user-provided metadata. You use that metadata to synthesize a useful filename. However, the synthesized filename may contain characters which are illegal on your target filesystem, and there's no way to know which characters are allowed in advance. Thus, if the file write fails, try it again with a more simplified filename.

Another example is when connecting to a network server. It's common for a server to advertise multiple addresses through a normal DNS entry or through ZeroConf. If your first attempt fails, try the other addresses before giving up. It's incredibly common, especially in a LAN environment with mixed IPv4/IPv6 addresses and ZeroConf advertisements, for half of a server's addresses to produce failures of some kind, and for the other half to work fine.

Occasionally it can be useful to simply retry the same operation more than once. Networking is a prime example of this. I can't count how many times I've loaded up Twitterrific on my iPhone and had it tell me that it couldn't connect to Twitter, only to have it work perfectly fine when I told it to try again. It would be great if it would try several times on its own before giving up.

Above all, make sure you test these fallback and retry paths! Many errors are rare, and it's not uncommon to have an error path which has never been executed. This can be fine, and even common, where your handler is just a log statement, but it's a very bad policy if you're actually doing real work there. If at all possible, set up unit tests to expose the error handling path. Even if you can't, be sure to at least manually test it after writing it to make sure that it works the way you want. An error handler which misbehaves is worse than one which simply logs the relevant information and gives up.

Displaying Errors
There's nothing complex here, it's just a bunch of annoying grunt work. Detect every useful error you can think of, and make it display an alert of some kind. Not much fun, and not much thought needed. The trick is that you can only do this for errors you anticipate, so you're limited.

Providing Diagnostics
When you're unable to work around the failure or display an error, you've gone beyond the realm of helping the user, but that doesn't mean that there's nothing else to do. Once you reach the point of crashing, freezing, or failing without an error message, you should consider how easy your code will be to debug.

As illustration, consider these two scenarios.

Your application crashes in a dealloc method called from NSPopAutoreleasePool called from the main event loop. No messages are logged.
Your application crashes in abort() after logging:
Warning: couldn't read file /some/path: error: Error Domain=NSCocoaErrorDomain Code=260 UserInfo=0x100605690 "The file "path" couldn't be opened because there is no such file." Underlying Error=(Error Domain=NSPOSIXErrorDomain Code=2 "The operation couldn't be completed. No such file or directory"
assertion failure in -[SomeClass someMethod] line 42: fileData != nil, aborting

Your response to #1 is likely that nameless dread that we get when seeing a really difficult bug. Your response to #2 is, "Oh, I guess I should put a more intelligent handler in someMethod."

There are two big tricks to making your app be more like #2.

First, always check for errors. I'll say it again: always check for errors. You don't have to handle them intelligently, but at least log them when you get one that's unexpected, and consider aborting, depending on the circumstances.

I'll make an exception to this for calls which do adequate logging on their own, such as malloc. There's no point in trying to recover from a malloc failure on OS X, because by the time you detect the failure and try to recover, your process is likely to already be doomed. There's no need to do your own logging, because malloc itself does a good job of that. And finally there's no real need to even explicitly abort, because any malloc failure is virtually guaranteed to result in an instantaneous crash with a good stack trace.

Some of you may have heard of Steinbach's Rule, which goes: "Never test for an error condition you don't know how to handle." This rule is tongue in cheek, but is partially correct. You should always check for errors, but if you don't know how to handle them, then don't try. It's much better to have a program which produces clear logs and an obvious crash when something really unexpected happens than to have a program run a bunch of poorly thought out and poorly tested code to try to handle the error "properly" when you don't have a clear idea of what that actually means.

The second trick is to be liberal with asserts. The trick with asserts is that they should be used with conditions that you know must be true, but which are somehow doubtful. Don't use an assert for something that could definitely fail unless there's absolutely way to continue execution afterwards. For example, this is not a good way to go:

    int fd = open(...);
    assert(fd >= 0);

You can easily have open fail, and you'll want better handling than just asserting and blowing up. This is the sort of thing you should be able to get back to the user in the form of a real error message somehow, even if it's not a very useful one. If the error message can't be useful, and the failure isn't considered "normal", consider logging more thorough information right at the site of failure so that the console logs will at least be informative to you.

On the other hand, this is a good way to use asserts:

    void DoSomethingWithFileDescriptor(fd)
    {
        assert(fd >= 0);
        ...
    }

Many failures are not due to external events, like files being unavailable, but are simply an internal clash of assumptions or outright bugs. By sprinkling asserts around on the conditions you know to be true, you ensure that your program will fail in an informative fashion when it turns out that your assumptions have been violated somehow.

Conclusion
While these techniques are all useful for defensive programming, overall it's largely a matter of attitude. You need to get used to asking the question, "What if it fails?" It's easy to get fixated on simply making sure that the code works. After all, that's hard enough as it is. But as you're writing the code, take the time to ask, "What if it fails?" The result will be a more robust program that behaves better, crashes less, and is easier to debug.

That's it for this week. Come back in seven days for another exciting edition of Friday Q&A. As always, Friday Q&A is driven by your ideas, so send them in! The more ideas I get, the better this series can be, so don't be shy.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Wade Tregaskis at 2009-10-10 07:25:50:

Too few people promote the use of actual error checking. Having spent many years maintaining other people's code, I'm about ready to beat to death with their own severed limbs the next person I catch writing ignorant code. It's stupid, it's unprofessional, and in a world with more wide-spread defect tracking, it'd get you fired.

But your advice is contrary. You say you should check for errors, only don't bother if you don't actually have a way to handle it. What? 99% of the errors that could possibly happen, you don't know how to handle, and the remaining 1% are usually such soft errors as "my prefs file might not open because this might be my first run", etc.

Myself, I *always* check for errors. Math errors, NULL pointers, anything. Even from malloc. You are not unjustified in saying malloc failure is very problematic, because indeed many many common frameworks and libraries just crash in such situations (*cough*CoreFoundation*cough*). But it's not a black or white thing. Some percentage of your program's mallocs are done in your own code. If you can simply not crash for that fraction of the time, then why accept any less.

And delving further into this specific example, "out of memory" is a funny thing. I can ask malloc for a gigabyte of memory, and it'll fail, "out of memory". Yet I can then allocate a thousand objects, keep calling functions and extending my stack, etc. A lot of apps can actually recover reasonably often from "out of memory" errors for this reason; detection of the error leads to failure in that code, which then unwinds all the way back, releasing whatever stuff you've allocated to that point, etc. Putting you back into a place where you've got at least some free memory.

And maybe you never anticipated asking for that much memory, because you shouldn't be - you might have an overflow or some other bug that causes you to ask malloc for the wrong thing. Having your app gracefully fail and log an error message stating exactly what it just tried to do (quoting the requested malloc size) will instantly lead you to the root problem. Otherwise, you'd probably assume your app just uses too much memory, and mistakenly fire up Object Alloc instead of fixing the real bug.

Getting back to the larger picture... it doesn't hurt anything to do consistent, exhaustive error checking. And by error checking I don't mean asserts - why the heck would you deliberately add further potential crash points to your app?!?!? - I mean catching the error, logging it, and failing gracefully. The resulting logging will help you fix problems much faster than just crashing would - and heck, that's assuming you've got a bug which is nice enough to merely cause a crash. And for end users, it's infinitely better - okay, so I clicked a button and nothing happened. That's a bug. I'll probably complain to the developer about that, if it actually prevents me getting what I want done. But I'm sure going to complain in a lot nicer tone than for a bug where clicking a button makes it crash, losing an hours work.

There is admittedly some argument worth broaching w.r.t. self-confidence within your app after an ambiguous or unqualified error, but then that tends to be handled incidentally by things you should already be doing anyway, for orthogonal reasons: i.e. automatically making backups when saving documents, saving files atomically, etc. And users should have Time Machine backups and so forth. And so much stuff is on the network these days (meaning as simple as emailed around, not necessarily some fancy-pants hipster cloud hype) that it's far better to risk data corruption - which could be recoverable after all, anyway - vs certain data "corruption" in the sense of just crashing and losing everything.

As an addendum, I'll concede that the practical issue with not crashing is that you don't get as much feedback. Users are less enraged when you don't crash - though that's the point, after all - and so are less likely to actually report problems. The real issue in my mind is that there's just no infrastructure to automatically send back soft error reports. I'd be perfectly happy to have the apps I use silently send back (anonymous) error reports (after I okay it the first time). Or throw up a CrashReporter-like dialog saying "hey, yeah, you just clicked a button and it failed, don't worry, I noticed, please click one more button so I can tell my evidently flawed creator about it". Or run an app which simply monitors the console log and spools the output from each app back to its respective developer. Things like that'd be a good project for someone to work on. ;)

mikeash at 2009-10-10 14:28:54:

Why do you say my advice is contrary? Where do I say that you shouldn't bother checking for errors if you don't have a way to handle it? The only place this idea exists is within a quote, which I immediately say is only half right. You might want to try going back and reading things again. What I actually say is that you should always check for errors, but you should not always try to handle them. In other words, you shouldn't try to recover from all error conditions. Sometimes an error just means you should log and abort. Trying to handle an error that you don't know how to handle is worse than not checking for it at all.

Your comment about detecting and recovering from malloc failures with extremely large requests is insightful. Extremely large allocations are a case where it can fail without necessarily taking down your entire program or putting it into a bad state.

Regarding asserts, you deliberately add further potential crash points to your app because it's better than the alternative. An assert should only be for constraint which, if broken, will lead to a crash or data corruption. Asserting early and causing a deliberate crash with concrete information about the reason is better than crashing later. You seem to think that data corruption is better than a crash. This is, simply put, insane. You think backups eliminate the need to worry about data integrity. What if your user doesn't discover the corruption until all of his pre-corruption backups are gone? If you honestly think that it's better to corrupt data than to crash, I'd appreciate it if you could post a list of applications you've worked on so that I can be sure to never, ever, ever run any of them on my system.

Jeff Johnson at 2009-10-10 15:44:05:

Wade, are you saying you check for and 'gracefully' handle nil return for every [NSMutableArray array], for example?

Data corruption is the *worst* thing you can do as a programmer. (Well, you could kill people, but I'm assuming consumer Mac apps here and not air traffic controls systems.) If an app crashes without data corruption, you can simply relaunch and be back to work. Unsaved data will be lost, but that's why you implement some kind of autosave. ;-)

Don at 2009-10-12 19:33:44:

Great advice and I agree wholeheartedly. Practically, though, I fall more often than I'd like into "if you don't know how to handle them, then don't try". Logs are fine after the fact, but constructing a graceful response in advance seems to depend on the application, the specific interaction, and the application domain. For a practicum, for example, how would you specifically deal with the issues brought up by your NSData allocation example?

mikeash at 2009-10-13 03:24:52:

How to deal with the NSData issues depends on exactly why you're reading the file.

If you're reading it because it's an application resource, I would just ensure that it will provide some kind of vaguely sensible log statement if something goes wrong and leave it at that. If your app bundle is hosed then you have big problems anyway.

If you're reading it because the user told you to read it, first, use the NSData methods that return an NSError, and present that error to the user upon failure. That takes care of problems where it can't be read. Next, validate the format extensively. You should be doing this anyway, of course, to ensure that your app can't be used as a security hole and such. Take advantage of system-provided APIs if you're dealing with a common format like XML or an image format. This takes care of the unexpected data problems. For enormous files, it can be good to have a sanity check on file sizes, but be careful not to make it too small. The user will not be pleased if he legitimately has a 10GB file that your application refuses to work on despite the fact that it's a 64-bit app and he has 15GB of RAM. For taking a long time to read, you could provide a progress indicator that pops up after a couple of seconds. However, showing a SPOD is acceptable even if it's not particularly good, because the user should understand that it's doing what he just told it to do. Opening a file is something that users are probably trained to understand that it can take a while.

The real torture case is opening arbitrary files that the user didn't explicitly tell you to open. This could be because you're indexing something, or it's some kind of external plugin data, or similar. Much of the previous advice holds, but you'll want to be even more paranoid about it. In particular, be sure that all IO of this nature is not performed on the main thread, or in any other way which could end up blocking user interaction. Assume that the IO could take forever, and make sure that your program tolerates that. Have sanity checks on anything you can think of. If you can tolerate not reading some files, then err on the side of caution. For example, if you're scanning for images, it may be better to skip over any image file bigger than, say, 200MB, even if it's potentially legitimate. Strategies will vary depending on exactly what you're doing, so you'll have to look at each situation individually, but that's the general idea.

Anonymous at 2009-10-13 15:54:37:

(Jeff: For what it's worth, killing people could roughly be considered "data-loss", so you'd still be right... :)

Michael Long at 2009-10-17 19:57:51:

Also regarding asserts, adding them is an additional safeguard to help protect against other developers who may at some point be working with or maintaining your code.

You know that a specific function expects a valid file descriptor, but someone else just added to the team might not be as knowledgeable. Or it may be a simple bug in his code. Or yours.

Worse, due to various circumstances, what if the function fails silently to do a bad parameter? I had a developer come to be me once excited that his "improvements" to a sorting algorithm had decreased sort times by 300%.

Which, I found, is relatively easy to do if you never actually sort anything...

mikeash at 2009-10-18 00:39:07:

I agree with what you say, with the provision that "other developers" includes yourself, six months into the future.

Some people will use assert macros which are only enabled for debug builds, and which get compiled away in release builds. I think that even Wade would find it hard to argue against those. However, I think it's better to leave them in for release builds as well, for the reasons I discussed previously.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.