Random in the wild

colmmacc · on Dec 9, 2014

My own favourite random() in the wild bug is one that I've come across many many times:

    /* Generate a random number 0...5 inclusive */
    int r = random() % 6;

The problem is that this results in a bias towards certain values, because the random() return space is probably not a whole multiple of the number you are modding. It's easier if you think what would happen if RAND_MAX were "10". Then the results 0,1,2,3,4 each have two opportunities of being selected for r, but "5" only has one. So 5 is half as likely as any other number. Terrible!

Using float/double random()'s don't always help either, because floating point multiplication has non-uniform errors. Instead, what you really need is:

   int random_n(int top) {
       if (top <= 0) {
            return -1;
       }
    
       while(1) {
           int r = random(top);
           if (r < (RAND_MAX - (RAND_MAX % top))) {
                return r;
           }
       }
     
       return -1;   
    }

Although it's better to replace random() itself with something that uses real entropy.

donmcc · on Dec 9, 2014

Hence arc4random_uniform() in OpenBSD, which accounts for modulo bias (https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#Modulo_bi...).

conistonwater · on Dec 9, 2014

On my machine RAND_MAX is 2^31-1. So getting the probability wrong in the way you describe means a relative error of one in two billion. That's really quite small for non-critical applications.

colmmacc · on Dec 9, 2014

It's not so simple. The larger the modulus, the larger the error; up to a limit. If the modulus is 2^30 + 2^29 for example, then values of r in the range 0 ... 2^29 will be represented twice as often as values in the range (2^29 + 1) ... (2^30 + 2^29). This is much more significant error than one in two billion (it's close to 1 in 3, nearly ten orders of magnitude more significant).

Subtleties like this are hard to detect on code review, and even in testing ... the real lesson may be that the very interface behind random(void) is just broken. It really should be random(int) and take care of all of this for you, as many other languages/libraries do.

conistonwater · on Dec 9, 2014

Yes, you are right, I didn't think of really large moduluses of magnitude comparable to RAND_MAX.

clarry · on Dec 9, 2014

Doesn't matter what your RAND_MAX is if you're taking the result modulo anything that's not a power of two.

conistonwater · on Dec 9, 2014

No, it does matter. If RAND_MAX is really large, the relative error in "incorrect" probabilities is really small. Unless you have a critical need for it to be precisely correct, the error is basically negligible.

clarry · on Dec 10, 2014

I have no idea what I was thinking, but I stand corrected. Thanks.

christianmann · on Dec 10, 2014

And if you have a need for it to be precisely correct, I have bad news for you about the word "random".

sjolsen · on Dec 9, 2014

This sort of code is one of the motivations behind improving the random number generation facilities available in C++. If you're using C++ or you're using C and have the option to link with C++, I strongly recommend looking at the random number library introduced in C++11 [1]. In addition to letting you specify a statistical distribution (including a proper uniform distribution for both integral and floating-point types), it lets you choose between various PRNG engines with various trade-offs. It also provides a way to source hardware entropy with which to seed an engine, and it's all pretty easy to use.

*[1] en.cppreference.com/w/cpp/numeric

tptacek · on Dec 10, 2014

It's nice to make it clear what distribution you're going to get from a PRNG, or what the performance characteristics are going to be. But for cryptographic randomness (the subtext of 'tedunangst's post), what you don't want are a bunch of knobs and dials. What you generally need is a guarantee that your RNG interface is feeding you data from the system secure random number generator --- either urandom or CryptGenRandom.

I'm also not sure what "hardware" has to do with that guarantee.

One example I can think of where a standard library got this right is Golang:

http://golang.org/pkg/crypto/rand/

pbsd · on Dec 10, 2014

FWIW, the C++ random number library is very much not meant for cryptography. This is obvious when you see the weak guarantees given by std::random_device: on Windows, libstdc++ will actually use the Mersenne Twister with a hardcoded seed, and this is standard-compliant! Additionally, every engine included in the library is not cryptographic at all.

The Boost random library actually does the right thing, and states that boost::random_device must not be implemented where a proper source of random bits is not available. The standard tries to be all things to all people, and std::random_device ends up being useless for any serious application.

As a mathematical random number generation library, the C++ library is much superior to its Go counterpart, math/rand.

sjolsen · on Dec 10, 2014

> what you don't want are a bunch of knobs and dials. What you generally need is a guarantee that your RNG interface is feeding you data from the system secure random number generator

The two are not necessarily incompatible. The distributions in the C++ standard library are not married to the standard engines. You can quite straightforwardly write an engine that uses urandom internally and which is compatible with the standard distributions; see the last couple dozen lines of libc++'s random.cpp.

Now, whether there are instances where using a statistical adapter with cryptographically secure random numbers is appropriate is another question, and one which I am not qualified to answer. However, such adapters as std::uniform_int_distribution and std::independent_bits_engine, if it can be verified that they are implemented correctly (and that is admittedly a big "if"), would probably still be useful.

jacobparker · on Dec 9, 2014

Language lawyer note: std::random_device isn't guaranteed to be hardware-based, or anything in particular:

http://en.cppreference.com/w/cpp/numeric/random/random_devic...

(GNU libstdc++ will use rdrand if possible, otherwise it will read from /dev/urandom: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libstdc%2B...)

sjolsen · on Dec 9, 2014

You are correct; and of course, any hardware-level behaviour is likely to be implementation defined. I should note that MSVC [1] guarantees "non-deterministic and cryptographically secure" behaviour; and that libc++ [2] uses the "cryptographically secure" rand_s [3] on Windows, NaCl on other platforms where NaCl is available, and /dev/urandom where it isn't.

[1] http://msdn.microsoft.com/en-us/library/bb982250.aspx

[2] https://github.com/llvm-mirror/libcxx/blob/master/src/random...

[3] http://msdn.microsoft.com/en-us/library/sxtz2fa8.aspx

nepalisaathi · on Dec 9, 2014

Proper link en.cppreference.com/w/cpp/numeric

clarry · on Dec 9, 2014

I would've appreciated if you'd annotated each snippet with the source so as to make it easier for us to find the program it came from. One interesting question to ask next would be, what do these programs do with their random numbers? And so, does the quality of the stream or the non-repeatability of it matter at all?

waterhouse · on Dec 9, 2014

Since he's mocking and snarking at each snippet, I think adding pointers to the sources would make it come off as an incitement to make fun of the authors, and rather more mean than he intended. An alternative would be to add direct citations but tone down the snark. I think Ted wanted to focus on the code examples themselves, and to make an overarching point about how "rand" is actually used in practice.

That said, if you want to figure it out, he did say they were "selected examples" from this list of projects: http://marc.info/?l=openbsd-tech&m=141776286105814&w=2 Also, you may be able to google for exact text matches with the code. (You might find other projects that have the exact same problem--but that is likely good enough.)

moron4hire · on Dec 9, 2014

There was a particularly egregious example near the end that a simple Google search definitively found only one copy.

mijoharas · on Dec 10, 2014

My favourite bit is: "Take 16 bytes of random data. No, wait, make that 15 bytes. Then hash it to four bytes to really squeeze the entropy in. Then seed."

Good article

viraptor · on Dec 10, 2014

With all the comments author makes about nonstandard and crazy behaviour, he actually misses some practical solutions and makes fun of them.

"The one operation that was not observed was substracting the pid from the time. More research into this subject is warranted."

It's actually simple (even if still not effective on pid wraparound) - pid numbers grow, at a rate of at least 1 per program execution. Time grows at around 1 second per second. If you substituted pid from time, there's a good chance you would get the same seed by running the app twice in a row.

So it's added instead, so that it always grows. And we pretend the wraparound happens very rarely.

Broken behaviour? Sure. Practical solution that works for 99% cases where non-critical randomness is required? Definitely.

sarciszewski · on Dec 9, 2014

For cryptography, there's really no reason to use rand(), mt_rand(), or the other insecure variants. No excuse, I should say.

Just use urandom. Or getentropy() if your OS supports it.

If you're not using it for cryptographic purposes, then I don't see why it matters. :)

glass- · on Dec 9, 2014

> If you're not using it for cryptographic purposes, then I don't see why it matters. :)

There was a program I was using that used libtommath, and used that library's mp_rand function. libtommath's mp_rand does various trickery but essentially adds two results of rand() together[0].

If you call rand() multiple times on OpenBSD you will get a sequence that alternates between odd and even, which is completely standards compliant. Because of this the result of adding two rand() results together will always be odd (odd + even = odd).

The program didn't need cryptographic levels of randomness, but only returning odd numbers turned out to be a bit of a problem.

[0] https://github.com/libtom/libtommath/blob/develop/bn_mp_rand...

sarciszewski · on Dec 10, 2014

That's really fascinating! :D

jsnell · on Dec 9, 2014

There are all kinds of apps that might not need crypto quality randomness, but do at least need the random number streams to be different between invocations of the program! A Monte Carlo simulation that always gets the same results isn't going to be too hot. A game that seeds the RNG with 16 bits of entropy will be fine for a single player, but not for a community. (Think of FreeCell or Minesweeper in Windows, though those might even have been just 15 bits).

conistonwater · on Dec 9, 2014

> A Monte Carlo simulation that always gets the same results isn't going to be too hot.

A Monte Carlo simulation absolutely needs to be able to produce the same numbers every time if you want to have a hope of debugging it. In a release version, yes, sure. But even there, people will still have concerns about replicating other people's results, so being careful with random seeds is important.

sfilmeyer · on Dec 10, 2014

There's a difference between "always uses the same seed" and "can be seeded for reproducibility". You probably want the latter and probably don't want the former.

sarciszewski · on Dec 9, 2014

> There are all kinds of apps that might not need crypto quality randomness, but do at least need the random number streams to be different between invocations of the program!

If you need high quality randomness, you probably should use urandom. If no where else, when the program is first executed.

And if you're on Windows: https://github.com/php/php-src/blob/e6ea376a91514c59d21f4b53...

pbsd · on Dec 9, 2014

On Windows, the easy way is to call `rand_s`: http://msdn.microsoft.com/en-us/library/sxtz2fa8.aspx

anders · on Dec 9, 2014

> Perhaps there is a reason why software like Lua, Python, and Ruby all include their own implementation of a Mersenne Twister.

As far as I know, Lua does not include Mersenne Twister. math.random() is just C rand().

pygy_ · on Dec 9, 2014

This is correct for the standard Lua interpreter [0].

LuaJIT bundles its own Tausworthe PRNG [1].

————

0. http://www.lua.org/manual/5.2/manual.html#pdf-math.random

1. http://luajit.org/extensions.html#math_random

Karunamon · on Dec 10, 2014

I've noticed that Ruby's RNG is not that great. I've got a couple of small IRC bot scripts, one is a dice roller, the other is one that takes random items from a Google spreadsheet (using simple Random#rand in the first case, and Array#sample in the second).

The dice seems biased to certain numbers, and the spreadsheet has the same items come up more than mundane statistics would otherwise suggest. (A 1/50 roll landing on the same item 2-3 times in a row, multiple times per day?)

It's gotten to the point where my users don't want to use my bot for RPGs. I'll have to see if using SecureRandom#random_number produces any different results

mbq · on Dec 9, 2014

Also MT still has to be seeded, it is not crypto-safe and its quality/computational load ratio is not too good.

Scaevolus · on Dec 9, 2014

It's far more complex than it needs to be. If you want a fast, high quality deterministic RNG, use something from the xorshift family.

edwintorok · on Dec 9, 2014

This is a good article describing fast non-cryptographic RNGs: http://www.drdobbs.com/tools/fast-high-quality-parallel-rand... http://www.drdobbs.com/tools/fast-high-quality-parallel-rand...

CMRES is quite impressive, I get ~1500 MB/s out of it. Quite useful for generating large unique files for various testcases.