Next article: Complete Friday Q&A Now Available

Previous article: Rogue Amoeba Is Hiring Again

Tags: float fridayqna

Welcome to the first Friday Q&A of the new year. For the first post of `0x7DB`

, I decided to write about practical floating point, a topic suggested by Phil Holland.

First, I want to discuss what I will cover. I do *not* intend a deep theoretical discussion of floating point calculations, nor even the sorts of things you'd need to know when doing heavy numeric or scientific calculations with them. The classic What Every Computer Scientist Should Know About Floating-Point Arithmetic covers that ground well.

What I intend to cover is how to approach floating-point arithmetic in a practical, pragmatic sense when writing your everyday Mac or iOS applications. Where to use floating point, where not to use it, what it's good for, and useful tricks.

**Myths**

Before getting into the meat, I want to discuss two myths which are fairly pervasive in the programming community:

- Floating point calculations are slow
- Floating point calculations are inaccurate

*faster*, due to freeing up integer units for other tasks. Even on iOS devices, the floating point support is entirely reasonable and there's no need to shy away without extensive profiling.

Floating point accuracy is harder to characterize than simply saying it's inaccurate. Many floating point calculations produce exact results. Most others produce results which are as close to the exact answer as is possible to represent. Accuracy must be properly understood to use floating point properly, but it's not always bad and not always a problem.

**Floating Point Representation**

While I don't want to get into the exact binary representation of floating point numbers, it is useful to understand the basics of how they are represented. Those interested in the lowest-level details can read about the IEEE-754 spec.

Note that there is nothing in C which *requires* the floating point types to use IEEE-754 semantics. However, that is what is used on all Apple platforms, and what you're likely to find anywhere else, so everything I discuss here assumes IEEE-754.

You are probably familiar with scientific notation. To put a number in scientific notation, you normalize the number by multiplying or dividing by 10 until the number is in the range `[1, 10)`

, and then you multiply it by a power of 10 to get it back where you want it:

- 42 = 4.2 × 10
^{1} - 998.75 = 9.9875 × 10
^{2} - 0.125 = 1.25 × 10
^{-1}

- 42 = 101010
_{2}= 1.01010_{2}× 2^{5} - 998.75 = 1111100110.11
_{2}= 1.11110011011_{2}× 2^{9} - 0.125 = 0.001
_{2}= 1.0 × 2^{-3}

- (1.01010, 5)
- (1.11110011011, 9)
- (1.0, -3)

*mantissa*, and the second component is the

*exponent*.

You'll notice that the leading digit on all three is 1. In fact, the leading digit will *always* be 1, except for representing zero, which is a special case. Since the leading digit is always 1, it's not necessary to store it. The pairs can then be reduced to:

- (01010, 5)
- (11110011011, 9)
- (0, -3)

There are some special cases. Zero is one of those, as is infinity, and various others. But the basics are these simple pairs.

**Observations**

Knowing the representation of these numbers, there are some useful observations that can be made about their properties.

Any integer whose binary representation fits within the mantissa can be precisely represented with no error. For a `double`

, this means that any integer up to 2^{53}, or about 18 quadrillion, can be represented exactly. In a `float`

, integers up to 2^{24}, or a bit under 16.8 million can be represented exactly.

Numbers much larger than this can be represented as well, but with less precision. Only even numbers can be represented when immediately past the above limits. As the numbers grow further, only multiples of 4 can be represented, then multiples of 8, then 16, etc.

Fractions can be represented *if and only if* they can be expressed as a sum of powers of two. For example, ^{3}/_{4} = ^{1}/_{2} + ^{1}/_{4} = 1.1 × 2^{-1} = (1, -1). However, a seemingly simple number such as ^{1}/_{10} cannot be precisely represented in floating point. The best you can do is a close approximation: (10011001100110011..., -4).

To put it differently: every floating point number can be precisely written out as a finite decimal. However, many finite decimals cannot be exactly represented as a floating point number. This is why you should never use floating point to represent currency.

**Literals**

When writing floating point constants in code, it's important to be mindful of the semantic difference between integer constants and floating point constants. For example, the following trap is common:

```
double halfpi = 1/2 * M_PI;
```

`halfpi`

is not the expected approximation, but rather zero. This is because both `1`

and `2`

are integers. The integer division `1/2`

produces `0`

, and `0 * M_PI`

also produces zero.
To fix this, it is necessary to simply place a decimal point on the literals to make them into floats. In a case like this, only one of the numbers needs it, because the other number will be converted to floating point automatically, but it's more clear to just do it with both:

```
double halfpi = 1.0/2.0 * M_PI;
```

`.0`

at the end of any integer constant used in a floating point expression to avoid unhappy mistakes like this.
**Accuracy**

There are various accuracy requirements placed on arithmetic operations on floating point numbers. In particular, the four basic operations of addition, subtraction, multiplication, and division, are required to produce exactly the correct result if the correct result is representable. If the correct result is not representable, then they must produce the closest possible floating point number to the correct result.

Combine this with the fact that a large range of integers are exactly representable. This means that, as long as the operands and result are within that range, addition, subtraction, and multiplication of integers in floating-point numbers *will be exact*. Division with an integral result will also be exact. In general, you can place integers in floating point numbers and, as long as you know the range of the numbers to be within what's required, you can count on full accuracy and no unpredictability.

This is how Cocoa can use `CGFloat`

for graphics coordinates. At first glance it might seem like a bad idea. Pixels are discrete units, floating point is continuous and inaccurate. However, any operation that works on whole pixels will produce exact results and no inaccuracy. Using floating point gives Cocoa additional flexibility to produce good approximate results when not working on whole pixels.

**Comparison**

It's commonly said that a C programmer should never use `==`

to compare floating point numbers. There's even a gcc warning specifically to catch this: `-Wfloat-equal`

. The reason given for this is that floating point inaccuracy means two numbers which *should* be exactly equal may in fact differ slightly due to rounding errors or other such computational inexactness.

While this is often true and a good rule of thumb to follow, as you can see it is not always the case. It *is* perfectly reasonable to use `==`

on floating point values *as long as you know that the values are completely accurate*. For example, if you're working purely on well-behaved integers, `==`

presents no problem:

```
double x = 1.0 + 2.0 * 3.0;
double y = (29.0 - 1.0) / 7.0 + 3.0;
if(x == y)
// guaranteed to be true
```

```
double one1 = 0.1 * 10.0;
double one2 = 1.0 / 3.0 * 3.0;
double one3 = 4.0 * atan(1.0) / M_PI;
if(one1 == 1.0 || one2 == 1.0 || one3 == 1.0)
// no guarantee any of these will be true
```

```
BOOL FloatAlmostEqual(double x, double y, double delta)
{
return fabs(x - y) <= delta;
}
```

```
double one = 0.1 * 10.0;
if(FloatAlmostEqual(one, 1.0, 0.0000001))
// this will be true
```

**Special Numbers**

There are a few kinds of special floating point numbers that are useful to understand.

The first is zero. IEEE-754 actually has two zeroes: positive and negative. While they are largely the same, and even compare as equal using `==`

, they do behave slightly differently. For example, `1.0 * 0.0`

produces positive zero, but `1.0 * -0.0`

produces negative zero. The concept of negative zero makes sense when considering floating point values as *approximations* to some theoretical exact number. Positive zero represents not only the precise quantity of zero, but a small range of extremely small positive numbers that are very close to zero. Likewise, negative zero represents zero and a small range of negative numbers very close to zero.

For the most part, negative zero has few practical consequences and can be ignored. For cases where it needs to be detected, it can be checked using `signbit`

:

```
BOOL IsNegativeZero(double x)
{
return x == 0.0 && signbit(x);
}
```

Infinity can be written in code by writing the `INFINITY`

macro. It can be detected with `isinf(x)`

. For the most part, floating point infinities behave the way you would expect them to. Adding or subtracting a finite number produces infinity. Multiplying or dividing by a positive number produces infinity, and a negative number switches the sign. Dividing a finite number by infinity produces zero.

Finally, there is Not a Number, or NaN. This represents the mathematical concept of "undefined", or at least a result which can't be represented as a real number. NaN is produced by operations such as taking the square root of -1, calculating `0.0/0.0`

, or `INFINITY - INFINITY`

.

NaNs have several unusual behaviors. Perhaps the most surprising is that NaN is not equal to anything, *not even itself*. The expression `x == x`

will be false if x is a NaN. NaNs also propagate through calculations. Any floating point operation where one operand is a NaN will produce NaN as the result. This means that code can do a single check for NaN at the end of a long calculation, rather than having to check after each operation that could potentially produce one.

NaN can be written in code with the `NAN`

macro, and can be detected using `isnan`

. They can also be detected using `x != x`

, but this is not recommended as some compilers get a little too clever while optimizing and will make that expression always be false.

**Math Functions**

There are a ton of useful math functions in the `math.h`

header. Each function comes in two variants. The plain function, for example `sin`

, takes a `double`

and returns a `double`

with the result. Functions which end in `f`

, for example `sinf`

, do the same except they operate on `float`

. This makes them a bit faster when your values are all `float`

. There are a few categories of functions worth mentioning:

**Trigonometric functions:**`sin`

,`cos`

,`tan`

, and others are all provided. These are, of course, useful for all kinds of geometric calculations.**Exponential functions:**`exp`

calculates powers of the mathematical constant`e`

, and`log`

calculates natural logarithms. Other functions are available to calculate powers of two and logarithms in other bases.**Powers:**the`pow`

function will calculate arbitrary exponents. The`sqrt`

function is specifically optimized to take square roots.**Integer conversion:**various functions to get a nearby integer from a floating point number, such as`ceil`

,`floor`

,`trunc`

,`round`

, and`rint`

.**Specialized floating point functions:**many functions which provide better performance or accuracy, or additional capabilities by taking advantage of the nature of floating point, such as`fma`

(performs a multiply and an add),`log1p`

(calculates the function`log(1 + x)`

), and`hypot`

.

`M_E`

(the mathematical constant `e`

) and `M_PI`

(π).
**Further Reading**

Mac OS X ships with some good documentation on floating point numbers. `man float`

discusses their general representation and behavior. `man math`

discusses the various functions in `math.h`

. Most of those functions also have their own man page which goes into more detail.

Finally, the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic is a somewhat difficult but extremely useful read to anyone who really wants to understand just how all of this stuff works and what consequences it has.

**Conclusion**

Floating point arithmetic can be strange, but if you understand the basics of how it works, it's nothing to be afraid of and can be extremely useful. Floating point can be great for physics, graphics, and even just internal bookkeeping. It's important to always be mindful of accuracy and other limits, but within those limits there's much that it's good for.

That's it for this time. Come back again in two weeks for another exciting edition. If you get bored while waiting, why not send me a topic suggestion?

Comments:

Back in the day, the Mac would use a 80-bit format for SANE (their software floating-point library), then also the 96-bit format of the 68k FPU; I don't remember whether they were mapped to the native Pascal floating-point types, though. And just for your amusement: the TI-89 (which I've been known to tinker with from time to time) uses a decimal-based floating-point format, probably to have results more accurate in appearance (e.g. finite decimals can always be represented accurately, as long as they don't go over the available mantissa, which is represented, yes, in Binary-Coded Decimal)

Floating-point is fast these days (at least for this audience), with one exception: on the ARMv6 iOS devices (or all iOS devices, if you have an ARMv6-only executable, as opposed to an ARMv6/ARMv7 fat one) if you've kept the default of compiling for thumb, as thumb on ARMv6 does not have access to the floating-point instructions, so every single addition, subtraction, multiplication, etc. has to be done with a system call. If you're using floating-point calculations in any non-trivial fashion on iOS, make sure you disable thumb at least for the ARMv6 half (Jeff Lamarche explains how to do so here: http://iphonedevelopment.blogspot.com/2010/07/thumb.html ). Also, intensive usage of doubles is not recommended on iOS (not that fast and the precision is usually pointless).

Every time you intend to do a floating-point equality comparison, think long and hard about what it is you actually want to do. Normally floating-point variables are purely data, and not used for control. Of course equality will work if the floating-point calculations only ever involve integers, but if you were sure that would always happen, you would be using integers and not floats, wouldn't you? If you want to check that your calculations do give the expected result, then you should figure out using numerical analysis the maximum error that can be introduced by your algorithm, then check that the result is within that error range. But in most cases this is not equality you want to check, but whether two values are close enough not to matter making a difference; how close enough (1%? 0.1%) depends on the purpose. For instance, I worked on a video application on iPad, and like the built-in Video app, we wanted to allow the user to switch between letterboxed and pan and scan. But if the aspect ratios of the screen and the video are close enough, the difference isn't going to matter, so might as well lock to letterboxed and hide the button; I decided this would be the case if and only if the aspect ratios differed by less than 1%.

Unit accuracy may give some surprises, especially when converting to float. For instance, I recently wanted to have a control that would represent a circle rotating around the center with a period of two seconds. So (among other things) I need to take the tv_sec of gettimeofday and divide it by 2, but how to do so? If I convert to float first, it's going to be too big for unit accuracy to be preserved (these days the number of seconds since the epoch is more that one billion, much more than 2^24) so it's not going to behave as expected. The solution is to first do the entirely integer operation of getting the remainder of division by 2, convert that to float, then divide that by 2.0; the integer part is gone, but it wasn't going to matter anyway after multiplying by 2.0*M_PI and getting the sinf and cosf.

J Osborne: the formats used for the vector architectures I know about (AltiVec, SSE, NEON) do match IEEE 754, it's the operations that usually take some liberties or do not implement the full IEEE 754 semantics (e.g. NEON, the vector architecture on ARM, not only flushes denormals to zero on input and output by default, but contrary to AltiVec cannot even be configured to preserve them).

Ross Carter: You should also be reading "What Every Computer Scientist Should Know...". You WILL write problematic floating-point code otherwise.

Since all and every numbers (excepted square roots, pi and e) used in a program are rational numbers, it should really become a standard asap.

Do you know a good and simple library to do rational representation in C (or in Obj-C)?

For "good", GMP provides rational number support. I don't think it qualifies as "simple", but you probably can't have both at the moment....

`x/INFINITY = 0`

for `!isinf(x)`

double halfpi = M_PI_2;

:)

Might have been interesting to mention casting. What happens casting float/double to an integer type? Particularly with nan & inf. :)

CGFloat. It's either float or double. That can make it annoying to deal with, especially with compiler 64 to 32 truncation warnings. To use sin() or sinf()? To use the 'f' suffix with constants or not?

(BTW I'm not actually asking...)

I'm working on a project where I have to lay out a grid of buttons. The size and number of buttons changed several times, and I got sick of recalculating the buffer between them, so I started doing this:

// 5 buttons, 46 pixels tall, in a 256 pixel height box

static const CGFloat kVerticalGap = (256 - (5 * 46))/6.0;

At a glance, the buttons appear to be perfectly spaced, even though that buffer evaluates to 4 1/3. Now that I look closely at the grid, it does appear that the second and fifth gaps are actually 5 pixels.

Primitives (like NSBezierPath) are fully antialiased (unless you turn that off). If you draw them on fractional pixels, you get fuzzy lines. This is a common problem for Cocoa developers, because drawing a line on integer coordinates actually puts the line halfway between two pixels, and you get a two-pixel-wide gray line instead of a single pixel black line like you want.

I'm not sure what happens with images. Could go either way.... Likewise with views.

In any case, a lot of Apple's code has smarts to draw on pixel boundaries even if you have fractional coordinates. Presumably that's what's happening with your buttons. If they were really being drawn at fractional coordinates, they would probably look terrible. Instead, I imagine their coordinates are getting rounded off for display.

I would still recommend rounding the coordinates yourself, rather than relying on Cocoa to do it for you, but it'll usually work at least.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

J Osborneat 2011-01-14 18:33:26:Largely true, but some exceptions:

"long double" isn't always a IEEE-754 float on Apple platforms. The G5 uses a head-tail format for long double. The x86 might use the 80-bit Intel stuff which is close but not the same (to Intel's credit IEEE-754 was still being created when Intel made the first FPU for the 8086, and the 80-bit format was strongly influenced by what IEEE was doing, but doesn't match what IEEE-754 eventually became).

The vector FP formats are not always IEEE-754, and while they are frequently "close" the special values (denormals, NaNs, INFs, and -0) are all suspect (as in if you rely on them go read the SSE-X spec, or whatever vector instruction set you are using to make sure they behave the way you expect).

A lot of CPUs have faster FP multiply and/or divide then integer multiply/divide, esp for the "worst case". Some x86's have a worst time integer multiply in the 20 cycle range while the FP multiply is well under 10 cycles. I seem to recall that divide is worse, but don't remember specific numbers.