mikeash.com: Friday Q&A 2011-01-04: Practical Floating Point

Posted at 2011-01-14 16:33 | RSS feed (Full text feed) | Blog Index
Next article: Complete Friday Q&A Now Available
Previous article: Rogue Amoeba Is Hiring Again
Tags: float fridayqna

Friday Q&A 2011-01-04: Practical Floating Point

by Mike Ash

Welcome to the first Friday Q&A of the new year. For the first post of 0x7DB, I decided to write about practical floating point, a topic suggested by Phil Holland.

First, I want to discuss what I will cover. I do not intend a deep theoretical discussion of floating point calculations, nor even the sorts of things you'd need to know when doing heavy numeric or scientific calculations with them. The classic What Every Computer Scientist Should Know About Floating-Point Arithmetic covers that ground well.

What I intend to cover is how to approach floating-point arithmetic in a practical, pragmatic sense when writing your everyday Mac or iOS applications. Where to use floating point, where not to use it, what it's good for, and useful tricks.

Myths
Before getting into the meat, I want to discuss two myths which are fairly pervasive in the programming community:

Floating point calculations are slow
Floating point calculations are inaccurate

Floating point was slow in the dark ages before a floating-point unit was a standard part of any CPU you'd be programming on. These days, it's pretty fast. On heftier processors (like on the Mac), using floating point can actually make your code go faster, due to freeing up integer units for other tasks. Even on iOS devices, the floating point support is entirely reasonable and there's no need to shy away without extensive profiling.

Floating point accuracy is harder to characterize than simply saying it's inaccurate. Many floating point calculations produce exact results. Most others produce results which are as close to the exact answer as is possible to represent. Accuracy must be properly understood to use floating point properly, but it's not always bad and not always a problem.

Floating Point Representation
While I don't want to get into the exact binary representation of floating point numbers, it is useful to understand the basics of how they are represented. Those interested in the lowest-level details can read about the IEEE-754 spec.

Note that there is nothing in C which requires the floating point types to use IEEE-754 semantics. However, that is what is used on all Apple platforms, and what you're likely to find anywhere else, so everything I discuss here assumes IEEE-754.

You are probably familiar with scientific notation. To put a number in scientific notation, you normalize the number by multiplying or dividing by 10 until the number is in the range [1, 10), and then you multiply it by a power of 10 to get it back where you want it:

42 = 4.2 × 10¹
998.75 = 9.9875 × 10²
0.125 = 1.25 × 10^-1

By using binary and powers of 2, this concept can be changed to something that's more friendly to computers:

42 = 101010₂ = 1.01010₂ × 2⁵
998.75 = 1111100110.11₂ = 1.11110011011₂ × 2⁹
0.125 = 0.001₂ = 1.0 × 2^-3

The 2 never changes, which means that these numbers can be represented as simple pairs:

(1.01010, 5)
(1.11110011011, 9)
(1.0, -3)

The first component of these pairs is called the mantissa, and the second component is the exponent.

You'll notice that the leading digit on all three is 1. In fact, the leading digit will always be 1, except for representing zero, which is a special case. Since the leading digit is always 1, it's not necessary to store it. The pairs can then be reduced to:

(01010, 5)
(11110011011, 9)
(0, -3)

This is how floating point numbers are represented. The mantissa is given a section of storage. The exponent is given another section. One more bit is used to indicate the sign.

There are some special cases. Zero is one of those, as is infinity, and various others. But the basics are these simple pairs.

Observations
Knowing the representation of these numbers, there are some useful observations that can be made about their properties.

Any integer whose binary representation fits within the mantissa can be precisely represented with no error. For a double, this means that any integer up to 2⁵³, or about 18 quadrillion, can be represented exactly. In a float, integers up to 2²⁴, or a bit under 16.8 million can be represented exactly.

Numbers much larger than this can be represented as well, but with less precision. Only even numbers can be represented when immediately past the above limits. As the numbers grow further, only multiples of 4 can be represented, then multiples of 8, then 16, etc.

Fractions can be represented if and only if they can be expressed as a sum of powers of two. For example, ³/₄ = ¹/₂ + ¹/₄ = 1.1 × 2^-1 = (1, -1). However, a seemingly simple number such as ¹/₁₀ cannot be precisely represented in floating point. The best you can do is a close approximation: (10011001100110011..., -4).

To put it differently: every floating point number can be precisely written out as a finite decimal. However, many finite decimals cannot be exactly represented as a floating point number. This is why you should never use floating point to represent currency.

Literals
When writing floating point constants in code, it's important to be mindful of the semantic difference between integer constants and floating point constants. For example, the following trap is common:

    double halfpi = 1/2 * M_PI;

The value of halfpi is not the expected approximation, but rather zero. This is because both 1 and 2 are integers. The integer division 1/2 produces 0, and 0 * M_PI also produces zero.

To fix this, it is necessary to simply place a decimal point on the literals to make them into floats. In a case like this, only one of the numbers needs it, because the other number will be converted to floating point automatically, but it's more clear to just do it with both:

    double halfpi = 1.0/2.0 * M_PI;

It's best to get into the habit of using .0 at the end of any integer constant used in a floating point expression to avoid unhappy mistakes like this.

Accuracy
There are various accuracy requirements placed on arithmetic operations on floating point numbers. In particular, the four basic operations of addition, subtraction, multiplication, and division, are required to produce exactly the correct result if the correct result is representable. If the correct result is not representable, then they must produce the closest possible floating point number to the correct result.

Combine this with the fact that a large range of integers are exactly representable. This means that, as long as the operands and result are within that range, addition, subtraction, and multiplication of integers in floating-point numbers will be exact. Division with an integral result will also be exact. In general, you can place integers in floating point numbers and, as long as you know the range of the numbers to be within what's required, you can count on full accuracy and no unpredictability.

This is how Cocoa can use CGFloat for graphics coordinates. At first glance it might seem like a bad idea. Pixels are discrete units, floating point is continuous and inaccurate. However, any operation that works on whole pixels will produce exact results and no inaccuracy. Using floating point gives Cocoa additional flexibility to produce good approximate results when not working on whole pixels.

Comparison
It's commonly said that a C programmer should never use == to compare floating point numbers. There's even a gcc warning specifically to catch this: -Wfloat-equal. The reason given for this is that floating point inaccuracy means two numbers which should be exactly equal may in fact differ slightly due to rounding errors or other such computational inexactness.

While this is often true and a good rule of thumb to follow, as you can see it is not always the case. It is perfectly reasonable to use == on floating point values as long as you know that the values are completely accurate. For example, if you're working purely on well-behaved integers, == presents no problem:

    double x = 1.0 + 2.0 * 3.0;
    double y = (29.0 - 1.0) / 7.0 + 3.0;
    if(x == y)
        // guaranteed to be true

However, operations on numbers which cannot be exactly represented will not fare so well:

    double one1 = 0.1 * 10.0;
    double one2 = 1.0 / 3.0 * 3.0;
    double one3 = 4.0 * atan(1.0) / M_PI;
    if(one1 == 1.0 || one2 == 1.0 || one3 == 1.0)
        // no guarantee any of these will be true

For cases where the calculation is expected to be inexact, it's best to compare for "equality" by checking to see if the two numbers are close together. A simple way to do this is to see if the difference is less than some delta. Deciding on the value for that delta can be tricky, and is often guesswork. Here's a function to perform this comparison:

    BOOL FloatAlmostEqual(double x, double y, double delta)
    {
        return fabs(x - y) <= delta;
    }

This can be used on less precise values:

    double one = 0.1 * 10.0;
    if(FloatAlmostEqual(one, 1.0, 0.0000001))
        // this will be true

There are also more advanced ways to compare floating point values, although in practice they are generally not necessary.

Special Numbers
There are a few kinds of special floating point numbers that are useful to understand.

The first is zero. IEEE-754 actually has two zeroes: positive and negative. While they are largely the same, and even compare as equal using ==, they do behave slightly differently. For example, 1.0 * 0.0 produces positive zero, but 1.0 * -0.0 produces negative zero. The concept of negative zero makes sense when considering floating point values as approximations to some theoretical exact number. Positive zero represents not only the precise quantity of zero, but a small range of extremely small positive numbers that are very close to zero. Likewise, negative zero represents zero and a small range of negative numbers very close to zero.

For the most part, negative zero has few practical consequences and can be ignored. For cases where it needs to be detected, it can be checked using signbit:

    BOOL IsNegativeZero(double x)
    {
        return x == 0.0 && signbit(x);
    }

Next is infinity. Like zero, there are two kinds of infinity: positive and negative. These are produced when a calculation overflows the largest representable number. They are also produced when dividing a non-zero number by zero. Although mathematically such an operation is undefined, not infinite, it again makes sense when thinking of a floating point zero as an approximate, not exact, zero.

Infinity can be written in code by writing the INFINITY macro. It can be detected with isinf(x). For the most part, floating point infinities behave the way you would expect them to. Adding or subtracting a finite number produces infinity. Multiplying or dividing by a positive number produces infinity, and a negative number switches the sign. Dividing a finite number by infinity produces zero.

Finally, there is Not a Number, or NaN. This represents the mathematical concept of "undefined", or at least a result which can't be represented as a real number. NaN is produced by operations such as taking the square root of -1, calculating 0.0/0.0, or INFINITY - INFINITY.

NaNs have several unusual behaviors. Perhaps the most surprising is that NaN is not equal to anything, not even itself. The expression x == x will be false if x is a NaN. NaNs also propagate through calculations. Any floating point operation where one operand is a NaN will produce NaN as the result. This means that code can do a single check for NaN at the end of a long calculation, rather than having to check after each operation that could potentially produce one.

NaN can be written in code with the NAN macro, and can be detected using isnan. They can also be detected using x != x, but this is not recommended as some compilers get a little too clever while optimizing and will make that expression always be false.

Math Functions
There are a ton of useful math functions in the math.h header. Each function comes in two variants. The plain function, for example sin, takes a double and returns a double with the result. Functions which end in f, for example sinf, do the same except they operate on float. This makes them a bit faster when your values are all float. There are a few categories of functions worth mentioning:

Trigonometric functions: sin, cos, tan, and others are all provided. These are, of course, useful for all kinds of geometric calculations.
Exponential functions: exp calculates powers of the mathematical constant e, and log calculates natural logarithms. Other functions are available to calculate powers of two and logarithms in other bases.
Powers: the pow function will calculate arbitrary exponents. The sqrt function is specifically optimized to take square roots.
Integer conversion: various functions to get a nearby integer from a floating point number, such as ceil, floor, trunc, round, and rint.
Specialized floating point functions: many functions which provide better performance or accuracy, or additional capabilities by taking advantage of the nature of floating point, such as fma (performs a multiply and an add), log1p (calculates the function log(1 + x)), and hypot.

There are also a bunch of useful constants defined in this header, such as M_E (the mathematical constant e) and M_PI (π).

Further Reading
Mac OS X ships with some good documentation on floating point numbers. man float discusses their general representation and behavior. man math discusses the various functions in math.h. Most of those functions also have their own man page which goes into more detail.

Finally, the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic is a somewhat difficult but extremely useful read to anyone who really wants to understand just how all of this stuff works and what consequences it has.

Conclusion
Floating point arithmetic can be strange, but if you understand the basics of how it works, it's nothing to be afraid of and can be extremely useful. Floating point can be great for physics, graphics, and even just internal bookkeeping. It's important to always be mindful of accuracy and other limits, but within those limits there's much that it's good for.

That's it for this time. Come back again in two weeks for another exciting edition. If you get bored while waiting, why not send me a topic suggestion?

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

J Osborne at 2011-01-14 18:33:26:

"Note that there is nothing in C which requires the floating point types to use IEEE-754 semantics. However, that is what is used on all Apple platforms, and what you're likely to find anywhere else, so everything I discuss here assumes IEEE-754."

Largely true, but some exceptions:

"long double" isn't always a IEEE-754 float on Apple platforms. The G5 uses a head-tail format for long double. The x86 might use the 80-bit Intel stuff which is close but not the same (to Intel's credit IEEE-754 was still being created when Intel made the first FPU for the 8086, and the 80-bit format was strongly influenced by what IEEE was doing, but doesn't match what IEEE-754 eventually became).

The vector FP formats are not always IEEE-754, and while they are frequently "close" the special values (denormals, NaNs, INFs, and -0) are all suspect (as in if you rely on them go read the SSE-X spec, or whatever vector instruction set you are using to make sure they behave the way you expect).

"On heftier processors (like on the Mac), using floating point can actually make your code go faster, due to freeing up integer units for other tasks."

A lot of CPUs have faster FP multiply and/or divide then integer multiply/divide, esp for the "worst case". Some x86's have a worst time integer multiply in the 20 cycle range while the FP multiply is well under 10 cycles. I seem to recall that divide is worse, but don't remember specific numbers.

Ross Carter at 2011-01-14 18:50:09:

This is exactly what I have been wanting to read for years. "What Every Computer Scientist Should Know..." is over my head. Now I understand! Thanks!

Chris at 2011-01-14 18:54:06:

Minor typo: "Trigonemtric" should be "Trigonometric".

mikeash at 2011-01-14 19:18:12:

Thanks for the typo patrol, fixed now.

Steven Wart at 2011-01-14 21:57:35:

very nice. if only we could convince the investment banks to stop using floating point numbers for their currency calculations :)

Pierre Lebeaupin at 2011-01-15 14:25:17:

Nice. A few complements:
Back in the day, the Mac would use a 80-bit format for SANE (their software floating-point library), then also the 96-bit format of the 68k FPU; I don't remember whether they were mapped to the native Pascal floating-point types, though. And just for your amusement: the TI-89 (which I've been known to tinker with from time to time) uses a decimal-based floating-point format, probably to have results more accurate in appearance (e.g. finite decimals can always be represented accurately, as long as they don't go over the available mantissa, which is represented, yes, in Binary-Coded Decimal)

Floating-point is fast these days (at least for this audience), with one exception: on the ARMv6 iOS devices (or all iOS devices, if you have an ARMv6-only executable, as opposed to an ARMv6/ARMv7 fat one) if you've kept the default of compiling for thumb, as thumb on ARMv6 does not have access to the floating-point instructions, so every single addition, subtraction, multiplication, etc. has to be done with a system call. If you're using floating-point calculations in any non-trivial fashion on iOS, make sure you disable thumb at least for the ARMv6 half (Jeff Lamarche explains how to do so here: http://iphonedevelopment.blogspot.com/2010/07/thumb.html ). Also, intensive usage of doubles is not recommended on iOS (not that fast and the precision is usually pointless).

Every time you intend to do a floating-point equality comparison, think long and hard about what it is you actually want to do. Normally floating-point variables are purely data, and not used for control. Of course equality will work if the floating-point calculations only ever involve integers, but if you were sure that would always happen, you would be using integers and not floats, wouldn't you? If you want to check that your calculations do give the expected result, then you should figure out using numerical analysis the maximum error that can be introduced by your algorithm, then check that the result is within that error range. But in most cases this is not equality you want to check, but whether two values are close enough not to matter making a difference; how close enough (1%? 0.1%) depends on the purpose. For instance, I worked on a video application on iPad, and like the built-in Video app, we wanted to allow the user to switch between letterboxed and pan and scan. But if the aspect ratios of the screen and the video are close enough, the difference isn't going to matter, so might as well lock to letterboxed and hide the button; I decided this would be the case if and only if the aspect ratios differed by less than 1%.

Unit accuracy may give some surprises, especially when converting to float. For instance, I recently wanted to have a control that would represent a circle rotating around the center with a period of two seconds. So (among other things) I need to take the tv_sec of gettimeofday and divide it by 2, but how to do so? If I convert to float first, it's going to be too big for unit accuracy to be preserved (these days the number of seconds since the epoch is more that one billion, much more than 2^24) so it's not going to behave as expected. The solution is to first do the entirely integer operation of getting the remainder of division by 2, convert that to float, then divide that by 2.0; the integer part is gone, but it wasn't going to matter anyway after multiplying by 2.0*M_PI and getting the sinf and cosf.

J Osborne: the formats used for the vector architectures I know about (AltiVec, SSE, NEON) do match IEEE 754, it's the operations that usually take some liberties or do not implement the full IEEE 754 semantics (e.g. NEON, the vector architecture on ARM, not only flushes denormals to zero on input and output by default, but contrary to AltiVec cannot even be configured to preserve them).

Ross Carter: You should also be reading "What Every Computer Scientist Should Know...". You WILL write problematic floating-point code otherwise.

Mike at 2011-01-17 14:24:36:

Small correction: 998.75 = 9.9875 × 10^2

mikeash at 2011-01-17 15:03:21:

How embarrassing. Fixed, thanks!

Simon at 2011-01-18 16:46:38:

Some modern languages start (finally) to offer rational data types, which are absolutely precise. (perl does it with bigrat, lisp does it too, python is starting to do it).

Since all and every numbers (excepted square roots, pi and e) used in a program are rational numbers, it should really become a standard asap.

Do you know a good and simple library to do rational representation in C (or in Obj-C)?

mikeash at 2011-01-18 17:36:41:

Square roots and pi tend to feature quite a lot in real code, especially when doing graphics, so I wouldn't say "all". However, I certainly agree with the utility of rational numbers.

For "good", GMP provides rational number support. I don't think it qualifies as "simple", but you probably can't have both at the moment....

James Bell at 2011-01-19 03:19:27:

You said that INFINITY/INFINITY results in zero. (The last sentence of the first paragraph about the INFINITY macro.) It's actually NaN. Infinity divided by infinity is indeterminate... but I guess you hadn't introduced that concept yet. :)

James Bell at 2011-01-19 03:22:58:

Did you maybe mean to say dividing a finite number by infinity is zero?

mikeash at 2011-01-20 00:38:16:

Yes, I guess I didn't word that very well, but I meant that x/INFINITY = 0 for !isinf(x)

Mike G. at 2011-01-20 19:42:11:

Of course, the correct way to get halfpi is
double halfpi = M_PI_2;
:)

Sean McBride at 2011-01-26 22:36:11:

Great article Mike. Two additional thoughts:

Might have been interesting to mention casting. What happens casting float/double to an integer type? Particularly with nan & inf. :)

CGFloat. It's either float or double. That can make it annoying to deal with, especially with compiler 64 to 32 truncation warnings. To use sin() or sinf()? To use the 'f' suffix with constants or not?

(BTW I'm not actually asking...)

Christopher Pickslay at 2011-02-24 07:01:40:

Another great post, Mike. You mention "Using floating point gives Cocoa additional flexibility to produce good approximate results when not working on whole pixels." Does this mean that Cocoa stores the actual float values, but rounds when drawing?

I'm working on a project where I have to lay out a grid of buttons. The size and number of buttons changed several times, and I got sick of recalculating the buffer between them, so I started doing this:

// 5 buttons, 46 pixels tall, in a 256 pixel height box
static const CGFloat kVerticalGap = (256 - (5 * 46))/6.0;

At a glance, the buttons appear to be perfectly spaced, even though that buffer evaluates to 4 1/3. Now that I look closely at the grid, it does appear that the second and fifth gaps are actually 5 pixels.

mikeash at 2011-02-24 18:24:29:

Cocoa does indeed store the actual float values, but drawing gets complicated.

Primitives (like NSBezierPath) are fully antialiased (unless you turn that off). If you draw them on fractional pixels, you get fuzzy lines. This is a common problem for Cocoa developers, because drawing a line on integer coordinates actually puts the line halfway between two pixels, and you get a two-pixel-wide gray line instead of a single pixel black line like you want.

I'm not sure what happens with images. Could go either way.... Likewise with views.

In any case, a lot of Apple's code has smarts to draw on pixel boundaries even if you have fractional coordinates. Presumably that's what's happening with your buttons. If they were really being drawn at fractional coordinates, they would probably look terrible. Instead, I imagine their coordinates are getting rounded off for display.

I would still recommend rounding the coordinates yourself, rather than relying on Cocoa to do it for you, but it'll usually work at least.

Joshua at 2011-05-27 00:22:23:

Fantastic post, thank you. I'm and absolute beginner when it comes to programming (sure, I can do a lot of 'things, but I know far to little in depth). Other then the old 'careful for with floats', would you have some other topics you could point at about pitfalls I will encounter? Perhaps a bit to broad a question, so let me rephrase: What's the last one simple thing that you did wrong and made you retrace your code for an hour?

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.