mikeash.com: Friday Q&A 2009-07-17: Format Strings Tips and Tricks

Posted at 2009-07-17 15:21 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2009-08-14: Practical Blocks
Previous article: Friday Q&A 2009-07-10: Type Specifiers in C, Part 3
Tags: c fridayqna

Friday Q&A 2009-07-17: Format Strings Tips and Tricks

by Mike Ash

Greetings and welcome back to Friday Q&A. This week I'm going to discuss some tips and tricks for using printf-style format strings in C, as suggested by Kevin Avila.

Introduction
Almost everyone doing C or Objective-C programming uses format strings. In C, they're used by the printf family of functions. In Cocoa, NSLog and NSString both use them. They're a powerful way to build strings, but many people only know the basics. This week I'll delve into some hidden corners to take full advantage of the power it offers. Note that if you don't know the basics already, this article isn't going to make a lot of sense to you, so read up on a good printf tutorial before continuing.

Finding the Documentation
Hopefully all my readers know this, but just in case: if you type man printf at your shell prompt, you will get a bunch of confusing stuff that does not appear relevant to C programming. That's because you're actually reading the documentation for the shell command printf, not the C function. To see documentation on the C function, you need to type man 3 printf. The Cocoa documentation also contains information on format strings, but since the only significant difference in Cocoa format strings is the addition of the %@ specifier for printing the -description of objects, I like to just use the printf documentation.

Varags and Type Promotion
Format strings are always used with a function (or method) that takes variable arguments. This is important for several reasons.

First, the more obvious reason is that C doesn't provide any mechanism for the called function to know how many or what type of variable arguments it got. This means that your format string must exactly match the arguments you provide. Any mismatch could lead to bad output or a crash.

The less obvious reason is that C promotes types in values that get passed as variable arguments. In short, anything smaller than an int gets promoted to int, and float gets promoted to double. So when you pass in a char, you'll use a format specifier for int to print it, and likewise with passing a float and using a double specifier.

Types of Unknown Size
Frequently when programming in C or Cocoa you'll use a typedef whose definition is not guaranteed. Examples of this are size_t, socklen_t, NSInteger, and CGFloat.

For size_t it's easy: printf actually has a format specifier for size_t: use the z with one of the standard int specifiers.

For CGFloat it's also easy: because float gets promoted to double, the same %f specifier will work with either. No need to change anything.

For socklen_t and NSInteger you need to get a little cleverer. You can't use %d because they might be bigger than an int. You can't use %ld or %lld because they might be smaller than those, and type promotion doesn't carry over. They could even be bigger than those. What you'll want to do here is make an explicit cast to your variable to a size you know will be large enough to hold it, and then use that specifier. For example:

    printf("%jd", (intmax_t)myNSInteger);

Strings of Limited Length
The %s specifier will print a C string. This is tremendously handy. However sometimes you want to print a sequence of characters that isn't necessarily a C string. For this, you can use the . (that's a period) modifier to specify a length. For example, here is a convenient way to turn a FourCharCode into an NSString:

    uint32_t valSwapped = CFSwapInt32HostToBig(fcc); // FCCs are stored backwards on Intel
    NSString *str = [NSString stringWithFormat:@"%.4s", &valSwapped;];

The .4 tells NSString that the string is only four characters long, which keeps it from running off the end.

Sometimes you don't know the length ahead of time. This used to happen a lot with Pascal strings, but they're getting pretty rare these days. For this, you can use * as your length, and then it will read the length as a separate argument. (Note that this separate argument must be of type int, so beware types of unknown size!)

Here's an example of that:

    printf("%.*s", length, charbuffer);

And here's how you can use that to print a Pascal string, in case you ever run into one:

    printf("%.*s", pstring[0], pstring + 1);

Printing Pointers
Printing pointers is a handy thing to do but many people don't know how to do it right. You often see code like this:

    printf("0x%x", pointer);

This is wrong! Not only is the output ugly (you don't get leading zeroes) but it's not guaranteed to work at all, because you're passing a pointer but specifying an int.

The correct way is easy: just use the %p specifier. You get nice hexadecimal output and the type always matches.

Beware of NULL
This one is so commonly ignored that gcc and clang actually have a workaround just for this, but it's still interesting to know. NULL can legally just be a #define to 0, like so:

    #define NULL 0

If you then try to pass NULL as a pointer argument to a vararg function like NSLog, your code is no longer conformant, because you're really passing an int! For example, this is, strictly speaking, wrong:

    printf("%p", NULL);

(Note that the same goes for nil.)

This is easy to fix: if you ever need to do this sort of thing, you can just cast the NULL to a pointer type like so:

    printf("%p", (void *)NULL);

Note that this problem is most commonly encountered in functions which need a NULL-terminated list of arguments, like -[NSArray arrayWithObjects:] or execl. Yes, that means all of the code out there which looks like this is, strictly speaking, wrong:

    [NSArray arrayWithObjects:a, b, c, nil];

How do we get away with it? The compiler helps. As I mentioned before, gcc and clang have a workaround for this. They #define NULL to be a magic symbol which has either pointer or integer type depending on the context in which it's used, so the correct pointer value is passed into the function.

Always Constant Format Strings
I see far too much code which does this:

    NSLog(someString);

This works most of the time, but what if someString contains the character sequence %@, or another format specifier? Then you probably crash.

It gets worse. What if you do this with printf or similar instead, and someString comes from a source outside your control, like off the internet? Then horrible things can occur.

One of the format specifiers supported by printf (but not Cocoa) is the %n specifier. This is very different from the other specifiers, in that it actually gives you a value back instead of taking one from you. It wants an int * argument, and will write the number of characters written so far into that argument. For example:

    printf("%d%n%d", a, &howmany, b);

After this executes, howmany will contain the width of the first integer being printed.

If an attacker has control over the format string, then they can use the %n specifier to write an arbitrary value to a location in memory! This can then be used to take over your program. This attack is not theoretical.

In general, you should not pass anything other than a constant string as a format string. Every so often it is useful to build a format string dynamically first, but think hard before you do this whether you can accomplish your goal without that, and if you do it, then take extra care to ensure that your string will always be valid.

Random Access Arguments
Typical format string usage is straight through start to finish. The first specifier uses the first argument, the second specifier uses the second argument, etc. However this is not mandatory! You can actually have any specifier use any argument. This is done by adding n$ to the format specifier, where n is the argument number to print. Arguments count from 1. For example, this prints the two arguments in reverse order:

    printf("a = %2$d  b = %1$d", b, a);

You can even reuse the same argument more than once. This can be handy when writing out a long string and you need to use the same variable string, for example a name, multiple times.

    printf("%1$s could not be accessed, error %d. Try rebooting %1$s.", name, err);

Note that if you do this, you must not skip any arguments. For example, this is invalid:

    printf("a = %2$d", b, a);

The reason for this is revealed in the fact that C does not tell the called function about the arguments. It has to retrieve all type information and argument counts from the format string itself. Here you're giving it incomplete information. It knows there are two arguments, but it has no idea of the type of the first argument. This means that it cannot know how to access the second argument, so the result of making this call is undefined.

Conclusion
That wraps up this week's Friday Q&A. There's a lot more to what format strings can do than what I discussed today. Read the man page and take a look at how you can control precision, padding, output formats, and more.

Friday Q&A will be going on hiatus for at least one week and probably two due to various things which are going to keep me busy in that time.

In the meantime, keep those suggestions coming in. The more topics I have to choose from, the better topics you'll be able to read, so send them in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Padraig Brady at 2009-07-17 23:03:20:

Good info thanks! I find the man pages on linux at least a bit hard to parse, so I created a <a href="http://www.pixelbeat.org/programming/gcc/format_specs.html>;printf crib sheet</a>. I also wrote some notes for printing variable sized ints since they're a common source of confusion: http://www.pixelbeat.org/programming/gcc/int_types/

Dave at 2009-07-17 23:17:38:

Very useful, thank you! The only comment I have is about using random access arguments. You have:

printf("a = %$2d b = %$1d", b, a);

When in reality it should be:

printf("a = %2$d b = %1$d", b, a);

The dollar sign should be after the digit.

Cheers,

Dave

mikeash at 2009-07-17 23:22:11:

How embarrassing! Thanks for letting me know. I've fixed up the article.

Peter Hosey at 2009-07-17 23:40:17:

For ... NSInteger you need to get a little cleverer. You can't use %d because [it] might be bigger than an int. You can't use %ld or %lld because [it] might be smaller than those, and type promotion doesn't carry over. [It] could even be bigger than those.

Are you talking about a hypothetical future version of Mac OS X? As of Mac OS X, it's always the same size as a long on all currently-supported architectures. I believe this goes for the iPhone as well.

NULL can legally just be a #define to 0...

Well, it can be. In C++ and Objective-C++, it may be. But in C and Objective-C, it isn't. See the definitions of __DARWIN_NULL in <sys/_types.h>, Nil and nil in <objc/objc.h>, and NULL everywhere that matters.

This is mainly valid as a portability concern: Some other operating system may be more free-wheeling in its headers' definition of NULL, and *then*, it's worth being careful with how you use NULL.

You can actually have any specifier use any argument. This is done by adding $n to the format specifier, where n is the argument number to print.

That's an extension; it's not part of the C99 standard. Moreover, it doesn't even work with printf; it's only available in Core Foundation. (And you forgot a \n in your format string.)

Peter Hosey at 2009-07-17 23:41:22:

Oh, now I see the problem with the dollar sign feature. Thanks, Dave. Still an extension, though.

Peter Hosey at 2009-07-17 23:42:04:

Yargh, proofreading fail.

As of Mac OS X, ...

I mean Mac OS X 10.5.7.

foobaz at 2009-07-17 23:52:53:

To print a size_t, you can also use %zu, like this:

NSLog(@"lod == %zu", layer.levelsOfDetail);

Jean-Daniel Dupas at 2009-07-18 00:08:14:

@Peter Hosey : NSInteger is defined as a 32 bit integer on 32 bit arch, and 64 bits integer on 64 bit arch. So no, it does not always have the same size on the current OS X version.

Yes, that means all of the code out there which looks like this is, strictly speaking, wrong:
[NSArray arrayWithObjects:a, b, c, nil];

It depends what you mean by "strictly". The C99 standard define NULL as a pointer, so this code is correct as long as you use a C99 compliant compiler.

Jose Vazquez at 2009-07-18 00:17:09:

Thank you! Interesting, to find out that after so many years of using it there were still so many useful tricks to learn. I have managed to bend printf to my will, but now I have a deeper understanding of it. The Type Promotion stuff in particular just dispelled a lot of magic.

Peter Hosey at 2009-07-18 01:15:26:

Jean-Daniel Dupas:

... no, it does not always have the same size on the current OS X version.

That's not what I said. I said it always has the same size as long on current Mac OS X.

mikeash at 2009-07-18 01:19:15:

Peter Hosey: "Always the same size" is completely irrelevant. Nothing in the standard says that you can pass an integer of one type and retrieve it using a different integer type of the same size, even if this works on most implementations.

You are correct that using %ld will correctly print an NSInteger on all current Cocoa architectures. And two years ago, pointers were always 32-bit on all current Cocoa architectures. Four years ago, integers were always big-endian on all current Cocoa architectures. If you write your code to depend on today's assumptions, your code will break tomorrow.

Numbered argument specifiers are not part of the C standard but they are part of the POSIX standard, so unless you need your code to be portable to non-POSIX platforms you can depend on them to exist. See http://www.opengroup.org/onlinepubs/000095399/functions/printf.html

However my example which mixes numbered specifiers and non-numbered specifiers is not supported at all. It's an all-or-nothing thing.

Jean-Daniel Dupas: I don't believe you're correct that C99 defines NULL as a pointer. The C99 standard is available here: http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf

The relevant passages are this:

An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant.

And:

The macro NULL is defined in <stddef.h> (and other headers) as a null pointer constant; see 7.17.

These both appear on page 47. Thus NULL can be correctly defined as 0, (void *)0, (3 - 3), (void *)(42/43), etc. No statement is made about it being required to be a pointer type as far as I can see.

Peter Hosey at 2009-07-18 01:35:08:

So you are talking about hypothetical future OS versions, then. As you say, these are good to keep in mind.

mikeash at 2009-07-18 02:08:19:

Yes, on current OS versions you can do all sorts of things. You can use %lx to print pointers, you can use %ld to print size_t and NSInteger and many other types, and in general you can just look up what the sizes are and use a specifier which mostly matches. Doesn't mean it's a good idea.

Jean-Daniel Dupas at 2009-07-18 03:47:52:

Doo. I read again the spec and you're right. Nothing tells that sizeof(null pointer constant) == sizeof(void *)

But I managed to find a interesting sentence in POSIX though:

3.244 Null Pointer

The value that is obtained by converting the number 0 into a pointer; for example, (void *) 0. The C language guarantees that this value does not match that of any legitimate pointer, so it is used by many functions that return pointers to indicate an error.

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html

mikeash at 2009-07-18 03:54:06:

Again, size is not what matters, only type. There is no guarantee that the integer 0 will be received correctly if the other side is expecting a pointer, even if they are both the same size. Mixing and matching will work on most systems but could result in bad data or even a crash on some.

Unless POSIX also defines NULL to be a "null pointer" then I'm afraid that definition isn't relevant to the question. All this definition means is that NULL is not necessarily a null pointer.

Vincent Gable at 2009-07-18 07:08:33:

Thanks mikeash, that's excellent information!

Something that's been very helpful to me when printf-debugging is using macros to print variables without ever having to mess with format strings. It turns out that 90% of the time I can just say `LOG_ID(name)` and the "name = Vincent" information is all I needed.
Details here:
http://vgable.com/blog/2008/08/05/simpler-logging-2/

Also, Dave Dribin created an excellent DDToNSString() function that can automagically convert a C-type into an NSSString:
http://www.dribin.org/dave/blog/archives/2008/09/22/convert_to_nsstring/

I've been using a modified DDToNSString() in a LOG_EXPR() macro that (mostly) Just Works no matter what type it's given. Once I've worked out a few more kinks, and understand the esoteric build settings it needs, I'll write something up on it.

Peter Bierman at 2009-07-18 08:36:44:

Good tips! A suggestion: mention the "solution" for the non-constant format strings vulnerability where you wrote about it. printf("%s", someString);

Jordy/Jediknil at 2009-07-18 15:41:06:

Huh. I really thought nil was (id)(NULL) and Nil was (Class)(NULL). That would have made everything a little more convenient.

Well, except that a Class is a valid id. And that we use NULL all over the place (NSError **, anyone?).

mikeash at 2009-07-18 19:04:20:

Conceptually nil and Nil are those types, but practically they don't have to be. In the absence of a real Objective-C language spec it's hard to say exactly what they can or can't be, but we can probably consider them to be equivalent to NULL.

However, it's not a problem for things like NSError **. The fact that NULL (and nil and Nil) can be an integer 0 is only a problem when using varargs. For explicitly typed parameters, the 0 will be converted to the null pointer.

Damien Sorresso at 2009-07-19 03:27:42:

Tiny nitpick.

Format strings are always used with a function (or method) that takes variable arguments. This is important for several reasons.

This sentence makes it seem like all functions which take variable arguments also take format strings. I know that's not what you meant, but a better wording would be "Functions or methods which take format strings always take a variable number of arguments."

That brings up another good Friday Q&A idea. Maybe you should cover implementing a function that uses variadic arguments, some of the pitfalls in doing so, etc. (Maybe even touch on variadic arguments in preprocessor macros.)

Also, Jean-Daniel, NULL and a null pointer are different. A null pointer is a pointer which has been assigned the value NULL. NULL itself is just 0.

Other fun Mac OS X-specific format string tidbits...
* NSString and CFString may be constructed with a format string, and you can specify "%@" to print the description of a Cocoa or CF object, respectively.
* The syslog(3) API allows you to specify "%m" to print the current errno. This does not require a corresponding argument in the argument list, so use with care.

mikeash at 2009-07-19 09:27:04:

I must disagree about the wording. A google search for "is always used with" reveals that the phrase is normally used in the way I did. In other words, "X is always used with Y" means that X implies Y, but not that Y implies X.

Thanks for the article idea, I'll put it on my list.

Damien Sorresso at 2009-07-20 11:39:31:

Upon a re-read, you're right. I just misinterpreted the sentence the first time.

Gwynne Raskind at 2009-07-20 16:01:44:

I beg to differ that format strings are always used with variadic arguments. strftime(3) is the most obvious example.

mikeash at 2009-07-20 18:53:14:

Although strftime's strings look superficially like printf's, they are actually completely different and such do not count. You can argue about what is or is not a "format string", but in this article I was only discussing the ones used by printf and similar functions, and those must always be used with variable arguments (whether in the form of a ... argument or a va_list).

Adam Rosenfield at 2009-07-25 23:33:47:

The exact output of the %p specifier is implementation-defined; some implementations prepend a 0x to the output, some don't, some use uppercase, and some use lowercase. If you want a guaranteed output format, you should do something like:

printf("0x%08x\n", (uint32_t)ptr);

And use llx instead of x for 64-bit systems. Whenever you're printing out pointer values, 99.99% of the time you're debugging something, so you know the size of pointers on your platform. Hence, it's ok to be lazy and ditch the pointer-to-integer cast entirely.

I also strongly recommend always compiling with the -Wformat warning option (enabled with -Wall) with GCC -- it'll help you catch a lot of easy-to-miss errors often due to typos such as too many arguments, not enough arguments, mismatched format specifiers and arguments, etc.

GCC also has a nifty `format' function attribute which you can use to tag any functions you write that are wrappers around printf/scanf (such as a custom logging function), and it can then check the arguments you pass to that -- see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#index-g_t_0040code_007bformat_007d-function-attribute-2291 for more info.

Peter N Lewis at 2009-07-27 17:21:22:

blockquote>For example, here is a convenient way to turn a FourCharCode into an NSString:
<pre> uint32_t valSwapped = CFSwapInt32HostToBig(fcc); // FCCs are stored backwards on Intel
NSString *str = [NSString stringWithFormat:@"%.4s", &valSwapped;];
</pre>

While %.Ns is a clever trick, this wont actually work in general because OSTypes are defined to be in MacRoman the character set where stringWithFormat uses the system encoding which may be different.

Instead you need to use something like:

NSData* data = [NSData dataWithBytes:&valSwapped length:sizeof(valSwapped)];
return [[[NSString alloc]initWithData:data encoding:NSMacOSRomanStringEncoding] autorelease];

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.