My own favourite random() in the wild bug is one that I've come across many many times:
/* Generate a random number 0...5 inclusive */
int r = random() % 6;
The problem is that this results in a bias towards certain values, because the random() return space is probably not a whole multiple of the number you are modding. It's easier if you think what would happen if RAND_MAX were "10". Then the results 0,1,2,3,4 each have two opportunities of being selected for r, but "5" only has one. So 5 is half as likely as any other number. Terrible!
Using float/double random()'s don't always help either, because floating point multiplication has non-uniform errors. Instead, what you really need is:
int random_n(int top) {
if (top <= 0) {
return -1;
}
while(1) {
int r = random(top);
if (r < (RAND_MAX - (RAND_MAX % top))) {
return r;
}
}
return -1;
}
Although it's better to replace random() itself with something that uses real entropy.
On my machine RAND_MAX is 2^31-1. So getting the probability wrong in the way you describe means a relative error of one in two billion. That's really quite small for non-critical applications.
It's not so simple. The larger the modulus, the larger the error; up to a limit. If the modulus is 2^30 + 2^29 for example, then values of r in the range 0 ... 2^29 will be represented twice as often as values in the range (2^29 + 1) ... (2^30 + 2^29). This is much more significant error than one in two billion (it's close to 1 in 3, nearly ten orders of magnitude more significant).
Subtleties like this are hard to detect on code review, and even in testing ... the real lesson may be that the very interface behind random(void) is just broken. It really should be random(int) and take care of all of this for you, as many other languages/libraries do.
No, it does matter. If RAND_MAX is really large, the relative error in "incorrect" probabilities is really small. Unless you have a critical need for it to be precisely correct, the error is basically negligible.
This sort of code is one of the motivations behind improving the random number generation facilities available in C++. If you're using C++ or you're using C and have the option to link with C++, I strongly recommend looking at the random number library introduced in C++11 [1]. In addition to letting you specify a statistical distribution (including a proper uniform distribution for both integral and floating-point types), it lets you choose between various PRNG engines with various trade-offs. It also provides a way to source hardware entropy with which to seed an engine, and it's all pretty easy to use.
It's nice to make it clear what distribution you're going to get from a PRNG, or what the performance characteristics are going to be. But for cryptographic randomness (the subtext of 'tedunangst's post), what you don't want are a bunch of knobs and dials. What you generally need is a guarantee that your RNG interface is feeding you data from the system secure random number generator --- either urandom or CryptGenRandom.
I'm also not sure what "hardware" has to do with that guarantee.
One example I can think of where a standard library got this right is Golang:
FWIW, the C++ random number library is very much not meant for cryptography. This is obvious when you see the weak guarantees given by std::random_device: on Windows, libstdc++ will actually use the Mersenne Twister with a hardcoded seed, and this is standard-compliant! Additionally, every engine included in the library is not cryptographic at all.
The Boost random library actually does the right thing, and states that boost::random_device must not be implemented where a proper source of random bits is not available. The standard tries to be all things to all people, and std::random_device ends up being useless for any serious application.
As a mathematical random number generation library, the C++ library is much superior to its Go counterpart, math/rand.
> what you don't want are a bunch of knobs and dials. What you generally need is a guarantee that your RNG interface is feeding you data from the system secure random number generator
The two are not necessarily incompatible. The distributions in the C++ standard library are not married to the standard engines. You can quite straightforwardly write an engine that uses urandom internally and which is compatible with the standard distributions; see the last couple dozen lines of libc++'s random.cpp.
Now, whether there are instances where using a statistical adapter with cryptographically secure random numbers is appropriate is another question, and one which I am not qualified to answer. However, such adapters as std::uniform_int_distribution and std::independent_bits_engine, if it can be verified that they are implemented correctly (and that is admittedly a big "if"), would probably still be useful.
You are correct; and of course, any hardware-level behaviour is likely to be implementation defined. I should note that MSVC [1] guarantees "non-deterministic and cryptographically secure" behaviour; and that libc++ [2] uses the "cryptographically secure" rand_s [3] on Windows, NaCl on other platforms where NaCl is available, and /dev/urandom where it isn't.
I would've appreciated if you'd annotated each snippet with the source so as to make it easier for us to find the program it came from. One interesting question to ask next would be, what do these programs do with their random numbers? And so, does the quality of the stream or the non-repeatability of it matter at all?
Since he's mocking and snarking at each snippet, I think adding pointers to the sources would make it come off as an incitement to make fun of the authors, and rather more mean than he intended. An alternative would be to add direct citations but tone down the snark. I think Ted wanted to focus on the code examples themselves, and to make an overarching point about how "rand" is actually used in practice.
That said, if you want to figure it out, he did say they were "selected examples" from this list of projects: http://marc.info/?l=openbsd-tech&m=141776286105814&w=2 Also, you may be able to google for exact text matches with the code. (You might find other projects that have the exact same problem--but that is likely good enough.)
My favourite bit is: "Take 16 bytes of random data. No, wait, make that 15 bytes. Then hash it to four bytes to really squeeze the entropy in. Then seed."
With all the comments author makes about nonstandard and crazy behaviour, he actually misses some practical solutions and makes fun of them.
"The one operation that was not observed was substracting the pid from the time. More research into this subject is warranted."
It's actually simple (even if still not effective on pid wraparound) - pid numbers grow, at a rate of at least 1 per program execution. Time grows at around 1 second per second. If you substituted pid from time, there's a good chance you would get the same seed by running the app twice in a row.
So it's added instead, so that it always grows. And we pretend the wraparound happens very rarely.
Broken behaviour? Sure. Practical solution that works for 99% cases where non-critical randomness is required? Definitely.
> If you're not using it for cryptographic purposes, then I don't see why it matters. :)
There was a program I was using that used libtommath, and used that library's mp_rand function. libtommath's mp_rand does various trickery but essentially adds two results of rand() together[0].
If you call rand() multiple times on OpenBSD you will get a sequence that alternates between odd and even, which is completely standards compliant. Because of this the result of adding two rand() results together will always be odd (odd + even = odd).
The program didn't need cryptographic levels of randomness, but only returning odd numbers turned out to be a bit of a problem.
There are all kinds of apps that might not need crypto quality randomness, but do at least need the random number streams to be different between invocations of the program! A Monte Carlo simulation that always gets the same results isn't going to be too hot. A game that seeds the RNG with 16 bits of entropy will be fine for a single player, but not for a community. (Think of FreeCell or Minesweeper in Windows, though those might even have been just 15 bits).
> A Monte Carlo simulation that always gets the same results isn't going to be too hot.
A Monte Carlo simulation absolutely needs to be able to produce the same numbers every time if you want to have a hope of debugging it. In a release version, yes, sure. But even there, people will still have concerns about replicating other people's results, so being careful with random seeds is important.
There's a difference between "always uses the same seed" and "can be seeded for reproducibility". You probably want the latter and probably don't want the former.
> There are all kinds of apps that might not need crypto quality randomness, but do at least need the random number streams to be different between invocations of the program!
If you need high quality randomness, you probably should use urandom. If no where else, when the program is first executed.
I've noticed that Ruby's RNG is not that great. I've got a couple of small IRC bot scripts, one is a dice roller, the other is one that takes random items from a Google spreadsheet (using simple Random#rand in the first case, and Array#sample in the second).
The dice seems biased to certain numbers, and the spreadsheet has the same items come up more than mundane statistics would otherwise suggest. (A 1/50 roll landing on the same item 2-3 times in a row, multiple times per day?)
It's gotten to the point where my users don't want to use my bot for RPGs. I'll have to see if using SecureRandom#random_number produces any different results
Using float/double random()'s don't always help either, because floating point multiplication has non-uniform errors. Instead, what you really need is:
Although it's better to replace random() itself with something that uses real entropy.