mikeash.com: just this guy, you know?

Posted at 2011-03-18 16:19 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2011-04-01: Signal Handling
Previous article: See Me Speak at Voices That Matter in Seattle
Tags: cryptography friday qna random
Friday Q&A 2011-03-18: Random Numbers
by Mike Ash  

A mere fourteen days have passed since the last time, but here is yet another Friday Q&A. Luis de la Rosa contributed this week's topic: proper generation and use of random numbers.

Introduction
When talking about random numbers in computer programs, what we usually mean is pseudorandom numbers. Computers have a hard time being truly random. Give the same program the same input and you'll get the same output. A pseudorandom number is one which is difficult to predict if you don't know the initial input, but which is still intrinsically deterministic.

Not all computer random numbers are pseudorandom. There are sources of truly random numbers available to computers. Sometimes it's specialized hardware, like reading thermal noise in an electrical circuit. Sometimes it's unpredictable external events, like hard disk head movement or keystroke timing. In this article, I will cover both types.

When generating random numbers, there are a few different aspects to consider. An obvious one is speed. Some methods for generating random numbers are much faster than others. Somewhat less obvious is simplicity. A solution which requires a huge amount of code isn't going to see much use.

Often highly important, but somewhat harder to evaluate, is the quality of random numbers. Some generators have highly predictable output. Some have highly unpredictable output, and some are somewhere in between.

Quality
There is a classic joke where a random number generator is implemented like so:

    int RandomNumber(void)
    {
        return 3;
    }
While silly, it's also interesting to consider. Theoretically, is this function truly wrong? After all, if you were to roll a die, it's possible, although increasingly unlikely, to roll a 3 every time. In practice, of course, such a thing is entirely useless.

However, it's a good starting point to consider the notion of random number quality. On one end, you have constants, like the above 3. On the other end, you generate a sequence of numbers which even an extremely clever adversary with enormous resources at his disposal could not predict.

There's a huge range in between. Sometimes your adversary isn't some sinister organization, but just a guy playing cards. Sometimes you need random numbers only good enough to fool a casual observer. Sometimes you need random numbers which are reasonably unpredictable but which wouldn't be impossible to reverse engineer.

These needs can be broken down into a few categories:

  1. A need for the basic appearance of unpredictability but doesn't need to stand up to scrutiny. An example might be a simulation of a flickering light, or a randomly bouncing ball.
  2. Difficult for a human observer to predict even if he's paying attention and making the attempt. An example for this might be a casual card game.
  3. Cryptographic quality randomness, where the sequence cannot be predicted even by a motivated adversary with enormous resources. This is what you want when practicing cryptography.
APIs
There are a wealth of random number APIs available on OS X. I will present a few commonly useful ones here.

rand()
Perhaps the oldest of these APIs is the standard library call rand(). At this point it's mostly a historical curiosity and there's little reason to use it, but I'll discuss it for tradition's sake if nothing else.

The quality of the random numbers produced by rand() is poor. The sequences are highly predictable and not very useful. This function can be useful for situations which need only the lowest quality of random numbers, but even then you might as well use something a bit better.

The numbers returned by rand() are based on a seed value which is stored as a global. To ensure that you don't get an identical sequence every time you run your program, it's necessary to initialize that seed value with something less predictable. This is a common theme with any pseudorandom number generator. Some initial source of true randomness must be provided in order to have any unpredictability.

The traditional way to seed rand() is to initialize the seed using the current time:

    srand(time(0));
This ensures that different invocations of the program have different seeds... provided that they are started at least one second apart!

Another way to initialize the seed is with sranddev(). This pulls from the OS's pool of true randomness to initialize the seed, which I'll discuss in a bit.

Relying on global state is generally considered to be bad. There's a companion function, rand_r(), which allows you to provide your own state storage.

However, as I mentioned above, there's little reason to use any of these.

random()
This is an improved version of rand(). It's considerably less predictable, and would generally be suitable for casual card games and other situations where you want your randomness to stand up to some scrutiny by the user. As with rand(), it needs to be initialized to produce an unpredictable sequence, and the best way to do that here is using srandomdev().

Unfortunately, there is no variant of this function which doesn't use global state, which makes it somewhat dangerous to use. Another downside is that it's slightly slower than rand(), although this is unlikely to matter in most applications.

jrand48()
This is a decent random number generator which also uses user-supplied state. Its state is an array of 3 unsigned shorts. To initialize its state, dump random data into that array. Like random(), it's good enough to stand up under casual scrutiny, but no more.

/dev/random
We finally arrive at the big gun of randomness. This is not a function, but rather a device file provided by the OS. You can open and read from it like any other file. The data read from it is good, cryptographic-quality randomness. Because it's implemented as a file and any read from it will have to go into the kernel, this is considerably slower to use than random number generation functions. Whether that matters or not will, of course, depend on the individual situation.

It is implemented by taking advantage of external (and possibly internal) sources of random data, such as keyboard input timings. This random data is then mixed together using the Yarrow algorithm which then produces the actual output data. Due to this, it is extremely difficult for an adversary to predict the output.

Since it's just a file, there are numerous ways to obtain data from it. A convenient way to obtain a fixed-size chunk of data is to use NSFileHandle:

    NSData *randomChunk = [[NSFileHandle fileHandleForReadingAtPath: @"/dev/random"] readDataOfLength: 20];
This is handy for quickly generating unique identifiers or encryption keys. If you need a lot of random data then it would be a good idea to open the file once and then read data on demand, but for simple one-offs, the above can be nice.

Note that opening that file is not guaranteed to succeed, so if you use that approach, be prepared for the data to end up nil. Depending on how important the quality of randomness is, you might fall back to a weaker random number generator, or you might simply abort the operation.

For those coming from Linux, a word of caution. On Linux, there are two random devices: /dev/random and /dev/urandom. /dev/random is a blocking interface that only returns as much random data as the system possesses at any particular time. Once the system runs out of randomness, no more can be fetched until more is generated (by listening to the user bang on the keyboard, or however the OS gathers true randomness). /dev/urandom is a non-blocking interface that always returns the requested amount of data, by using a pseudorandom generator even when true randomness is exhausted.

OS X has both of these as well, however they both act like /dev/urandom does on Linux. If you need a source of nothing but true randomness, I'm afraid you're out of luck. The good news is that the /dev/random pseudorandom generator should be good enough for pretty much any purpose.

Range Limiting
These functions return numbers in different ranges. random() returns numbers in the range of [0, 231-1]. /dev/random returns an arbitrary number of bytes, which means it can be used to fill the full range of any data type.

Frequently, you want random numbers in a smaller range. For example, you may want to choose a random element in an array, and for this you would need a random number in the range [0, [array count] - 1]. To do this, you need a way to cut down the range that comes out of the random number generator.

The simplest and perhaps oldest way to accomplish this is to use the mod operator, like so:

    int randomIndex = random() % [array count];
This works and is often good enough. However, it's flawed, because in most cases it doesn't provide a completely uniform distribution of random indices, even if random() itself is completely uniformly distributed. The problem comes when the size of the range of random numbers is not evenly divisible by [array count]. In this case, there are 231 possible random numbers, which means that if [array count] is not a power of two, the result will not be uniformly distributed.

To illustrate, let's take a somewhat absurd case and imagine that [array count] is 1431655765. (This value is 2/3 of 231, which is why I chose it specifically.) The random number generator then works out to:

    random() % 1431655765
There are now two cases to consider.

If random() returns a number in the range [0, 1431655764], that same number is used as the random index. If random() returns a number in the range [1431655765, 231 - 1], then the random index is equal to the number returned minus 1431655765. The result will lie in the range [0, 715827883], which is roughly the first half of the array.

This is a major problem! Our "random" index is twice as likely to fall within the first half of the array as the second half. As the desired range shrinks, this problem likewise shrinks, but it still exists for any desired range which can't evenly divide the original range.

The solution to this is to use as much as possible of the original range, and discard the number and try a new one if it falls outside what's usable.

The largest possible range is [0, M - 1], where M is the largest number less than 231 which is divisible by the target maximum. This number can be found with a bit of simple integer arithmetic. Once M is known, an uniform distribution of random numbers can be generated:

    int count = [array count];
    
    unsigned two31 = 1U << 31;
    unsigned maxUsable = (two31 / count) * count;
    
    int randomIndex = -1;
    while(randomIndex == -1)
    {
        unsigned num = random();
        if(num < maxUsable)
            randomIndex = num % count;
    }
Of course, better would be to write a function that handles the grunt work for us:
    int RandomUnder(int topPlusOne)
    {
        unsigned two31 = 1U << 31;
        unsigned maxUsable = (two31 / topPlusOne) * topPlusOne;
        
        while(1)
        {
            unsigned num = random();
            if(num < maxUsable)
                return num % topPlusOne;
        }
    }
This can then be used to easily generate our random index, or anything else:
    int randomIndex = RandomUnder([array count]);
The use of a loop which only terminates when a random number meets a certain condition might make you nervous. However, don't fret. In the worst case, slightly over 50% of the random numbers will terminate the loop. The chances of the loop going over 10 iterations is therefore less than 0.1%, and the chances of the loop going over 20 iterations is less than one in a million. In more common cases, the number of cases which cause the loop to repeat will be a small fraction of the overall range, and so even single repeats will be rare.

This returns a number in the range [0, X - 1] for some arbitrary X. This is usually good, but sometimes you want a range which starts at a different value. This is easy to accomplish. First, generate a zero-based random number with the appropriate maximum, then just shift it up:

    int RandomInRange(int bottom, int top)
    {
        int rangeSize = top - bottom;
        int zeroBased = RandomUnder(rangeSize);
        return zeroBased + bottom;
    }
Random Floats
Sometimes you need a random floating-point number rather than a random integer. To create a floating-point number with an uniform distribution, it's a simple bit of arithmetic from an uniform distribution of integers. For example, to generate a random number in the range [0.0, 1.0], just divide by the maximum integer value, being sure to carry out the calculation in floating point:
    double random01 = random() / (double)0x7fffffff;
Many other useful distributions of random floating-point numbers exist, but they are beyond the scope of this article. Uniform distributions are, in my experience, the most commonly useful to most programmers.

Conclusion
That wraps things up for today. I hope that you now know a little more about generating random numbers than you did before. For many common cases, the random() function is good enough. The /dev/random file is great for cases where you need something really robust.

Come back in two weeks for another frabulous Friday Q&A. As always, Friday Q&A is driven by reader suggestions, so in the meantime please keep sending me your suggestions for topics to cover.

Did you enjoy this article? I'm selling a whole book full of them. It's available for iBooks and Kindle, plus a direct download in PDF and ePub format. It's also available in paper for the old-fashioned. Click here for more information.

Comments:

leeg at 2011-03-18 17:31:00:
On iOS, because the sandbox stops you from interacting with /dev/random, you have to use SecRandomCopyBytes(). It's just an API to access the random device: http://developer.apple.com/library/ios/#DOCUMENTATION/Security/Reference/RandomizationReference/Reference/reference.html

mikeash at 2011-03-18 18:05:01:
While I find it extremely dumb that /dev/random is sandboxed out, SecRandomCopyBytes() actually looks easier to use.

wdn at 2011-03-18 18:15:06:
Nice writeup as always Mike. I've learned to always use arc4random() over rand() or random(). Is it a better quality generator or have significant benefits/disadvantages other than its wider range?

mikeash at 2011-03-18 18:28:37:
arc4random is supposed to be a cryptographic-quality generator (provided that it's seeded well). I left it out simply because it uses global state, and thus seems like it would be unsafe to use in a multithreaded program. (Being entirely data-driven, it wouldn't crash, but I can't discount the possibility that two threads hitting it at the same time could create predictability that an attacker could exploit.)

For cryptographic use, you can just use /dev/random or SecRandomCopyBytes instead, unless you really need speed. For non-cryptographic use, random() is good enough.

David Magda at 2011-03-18 23:00:10:
There's also the arc4random(3) family of functions. According to the man page, it returns numbers in the range of 0 to 2^32-1, so twice that of rand(3) and random(3). You also don't have to seed arc4random(3).

A quick trick to remove modulo bias: http://stackoverflow.com/questions/648739/

More recent releases of the BSDs have a arc4random_uniform(3), but it doesn't seem to have been back ported to Mac OS 10.6 yet. I've opened bug radr://9157098 for it.

ssp at 2011-03-18 23:40:35:
Just in case you didn't have the Playstation thing in mind, you can find a high-profile version of your 'classic joke' here:

http://media.ccc.de/browse/congress/2010/27c3-4087-en-console_hacking_2010.html

Gwynne at 2011-03-19 00:14:08:
While it's not built into OS X, I'd also mention the Mersenne Twister PRNG due to relatively widespread usage (notably in the PHP language): http://en.wikipedia.org/wiki/Mersenne_twister . It's not cryptographic quality, but it's a decent and relatively fast open-source generator, which can be trivially modified not to use global state.

Jay at 2011-03-19 02:02:50:
Browsing the code for srandomdev() I noticed the following code runs if a read from /dev/random fails:

struct timeval tv;
unsigned long junk;

gettimeofday(&tv, NULL);
srandom((getpid() << 16) ^ tv.tv_sec ^ tv.tv_usec ^ junk);

[http://www.opensource.apple.com/source/Libc/Libc-594.9.4/stdlib/FreeBSD/random.c]

A legitimate use for an uninitialized variable!

mikeash at 2011-03-19 12:19:49:
On the subject of randomness and uninitialized data, everyone who isn't already familiar should read about the infamous Debian OpenSSL vulnerability:

http://digitaloffense.net/tools/debian-openssl/

kyllikki at 2011-03-19 18:02:26:
An inexpensive source of true random numbers is http://www.entropykey.co.uk/ they do not list Mac os x directly but they provide source code.

Mine works fine on Linux so a port should not be too much trouble.

Ash at 2011-03-19 20:35:15:
Dont forget about drand48 http://pubs.opengroup.org/onlinepubs/007908799/xsh/drand48.html and friends, which are a pretty common unix way of generating random numbers. They are guaranteed to be uniform. But they are pseudo-random, so if you need more true randomness you might need to find a more random solution, but they

Matt Gallagher at 2011-03-20 01:17:37:
If you're using thousands of random numbers -- particularly floating point random numbers -- in 3 or more dimensions (as might happen when generating 3D graphics or running Monte Carlo simulations), random() can start looking less random than desired and /dev/random is too slow. In these situations, a Mersenne Twister pseudo-random generator is a good option. BSD-licensed C++ implementations here: http://www.bedaux.net/mtrand/

Jeffrey Dutky at 2011-03-20 01:25:59:
rather than loop when the random number is outside the desired range, why not translate and scale the number, then add a second small random number to the result? This would only ever call the random number generator twice (in the case where the first number was greater than MaxUsable) thus conserving the system stock of randomness, and providing much more predictable performance.

mikeash at 2011-03-20 02:05:40:
I don't think that such translation-and-scale operations can be achieved without bias when the original range and the desired range are relatively prime, although I could be wrong.

CharlieH at 2011-03-20 16:17:11:
Thanks for this, Mike, but what about speed for these methods?

mikeash at 2011-03-20 17:36:41:
They are listed in order of speed, with /dev/random being far and away the slowest (at least for smaller quantities of data) due to the need to make a system call and the complexity of the code in the kernel. As usual, though, profile and only optimize when needed.

Nathan de Vries at 2011-03-21 12:05:11:
Mike's dismissal of non-threadsafe functions immediately made me question my preference for arc4random — we both assumed that it wasn't thread safe since it uses global state, and the man page doesn't really shed any light on the matter.

Turns out that back in 2004 a patch was made to the FreeBSD libc project to make arc4random re-entrant. This patch has also made it into the version of libc that Apple ships, meaning you can use arc4random in multi-threaded environments.

See here for details:

http://www.freebsd.org/cgi/query-pr.cgi?pr=63287
http://www.opensource.apple.com/source/Libc/Libc-594.9.4/gen/FreeBSD/arc4random.c

mikeash at 2011-03-21 15:11:41:
What he said. It's too bad this isn't explicitly documented, but given the source code, arc4random looks like a great option for high-quality randomness on the Mac.

Since arc4random says that it initializes itself from /dev/random, and iOS doesn't allow direct access to /dev/random, I wonder if it uses SecRandomCopyBytes or if it just fails. I'm hoping the former.

bret at 2011-03-22 16:44:54:
The posted method for generating floating-point results in zeros in the low bits of the mantissa more often than necessary. Especially since it shows using double-precision and not single, which will *always* end up with some zeros, but even single-precision floats can end up with zeroed bits in the mantissa for very small values of the random integer.

If the application requires maximum precision, another way to generate uniform floating-point numbers is to fill the mantissa with random bits (i.e. 1.nnnnnnnn...), and then choose the exponent so that -1 occurs 50% of the time, -2 occurs 25% of the time, -3 occurs 1/8th of the time, etc.

Jean-Daniel at 2011-03-25 16:53:21:
Being entirely data-driven, it wouldn't crash, but I can't discount the possibility that two threads hitting it at the same time could create predictability that an attacker could exploit.


Woo, that would be the first time I see a multithreading bug improving predictability ;-)

For those who want a way to generate random numbers using a lot of known algorithms, you can have a look at libc++ ( http://libcxx.llvm.org/ ), which will probably be part of OS X in the future.

mikeash at 2011-03-29 01:07:15:
I agree that increasing predictability looks unlikely, but I can see it happening. If the generator uses a typical read-update-write cycle, then two threads hitting things just right could result in identical data being returned by both threads, which would be extremely bad. As noted above, this is entirely academic, as the function is thread safe.

tesw at 2011-04-01 16:54:03:
In your RandomUnder function, maxUsable should be unsigned instead of int, because it could be equal to two31, which would be negative as an int, resulting in infinite loops.

And nope I didn't catch that by reading the code, I got hit by the bug using the function verbatim.

mikeash at 2011-04-01 17:30:17:
Good catch, thanks. It's fixed in the article now. How embarrassing.

Sean M at 2011-04-30 04:18:12:
FWIW, the clang static analyzer complains when it sees rand() and random(), see:

http://llvm.org/svn/llvm-project/cfe/trunk/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp

Michael at 2011-08-26 00:48:08:
I have the need to generate a random number from 0 - ((2^32)-1). Of course random() only generates 0 - ((2^31)-1). Currently I am generating the number by calling random() twice and using the least significant 16 bits of each combined to create the 32 bit value. I would use arc4random but given the code is used in a non-cryptographic research I need reproducibility. Is there a better way to handle this? Is this method creating a similar non-uniform distribution as shown above?


u_int32_t randomNumber = (random() << 16) & 0xFFFF0000;
randomNumber |= random() & 0x0000FFFF;

mikeash at 2011-08-27 02:20:31:
The jrand48 function will do what you need.


Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
Web site:
Comment:
Formatting: <i> <b> <blockquote> <code>. URLs are automatically hyperlinked.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.