A Java Programmer’s Guide to Random Numbers, Part 3: Seeding

Posted in Java by Dan on April 10th, 2008

The trilogy concludes.

This is the last of three articles about random numbers in Java programs. The series covers the following topics and also serves as an introduction to the random numbers package of the Uncommons Maths library:

  • True Random Numbers and Pseudorandom Numbers
  • Statistical Quality (making sure the dice are not loaded)
  • Performance (because sometimes you really do need half a million random numbers a second)
  • Different kinds of randomness (because not everything can be modelled by tossing a coin)
  • Degrees of Freedom (is every possible outcome actually possible?)
  • Security (when not losing money depends on nobody knowing what happens next)

Part 3: Seeding

Part 1 of this series discussed different kinds of random number generators (RNGs), highlighted the issues with using the default Java RNGs (java.util.Random and java.security.SecureRandom) and introduced the Uncommons Maths library, which provides three fast, high-quality alternative Java RNGs. Part 2 showed how to generate random numbers from different probability distributions using the classes provided by Uncommons Maths.

One aspect of pseudorandom number generators (PRNGs) that we glossed over in part 1 is the issue of seeding. In this article, we’ll look at two important issues related to the seeding of PRNGs.

Degrees of Freedom

Suppose we want to shuffle a standard deck of 52 playing cards. The Java API provides an implementation of a very good algorithm for shuffling a list (see Collections.shuffle). It even provides a variant of this method that allows us to specify our own RNG to be used. So shuffling a deck of cards (assuming it is represented as a List) is easy in Java.

Unfortunately, even though the algorithm is a perfect shuffle and we can choose an RNG that exhibits no bias, our shuffling code is still likely to be flawed in one respect. The problem is that the RNG is almost certainly not able to generate every possible outcome at each state transition. There are many orderings of the deck that cannot occur. The RNG has insufficient degrees of freedom. To understand why this is, we need to consider how many ways there are of ordering a deck of 52 cards. I’ll spare you the maths and tell you that the answer is 52! (that’s 52 factorial) or, in longhand, 80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000.

That’s a pretty big number.

Recall from the first article in this series that when we seed a PRNG we get the same sequence of output each time that we use the same initial seed. The maximum number of possible initial states for the PRNG is equal to the number of possible seed values. If each RNG state gives us a different shuffle, the number of different deck orderings we can generate is determined by the range of the RNG seed.

The CellularAutomatonRNG is very fast, but it only uses a 32-bit seed. So using it for shuffling restricts the result of any given shuffle to one of 4,294,967,296 outcomes. That’s 4.3 billion possibilities but it may as well be zero given the number of outcomes that should be possible but aren’t.  [Note that a second shuffle may well allow a different set of 4,294,967,296 permutations than the first (so over time a larger number of outcomes are possible) but for any nth shuffle the number of possible outcomes is restricted by the number of starting states.]

We would probably want to avoid using java.util.Random but, if we did use it, we’d give it a 64-bit long as its seed. The algorithm only actually uses 48 of those bits for seeding so that gives us 281,474,976,710,656 possible initial RNG states. 281 trillion is an improvement, but still woefully inadequate.

The MersenneTwisterRNG uses a 64-bit seed. That’s 18,446,744,073,709,551,616 possible outcomes for any given shuffle. Even this huge number is a miserable, tiny, insignificant fraction of one percent of the full set of 52! possibilities that we ought to be able to generate.

SecureRandom may use different algorithms internally, depending on the platform. On Windows, the default is the SHA1PRNG provider, which gives us a 160-bit seed. Ignoring SecureRandom’s slowness, this is getting us closer to where we want to be (if you can call less than one trillionth of one percent of the possible outcomes ‘close’).

One more RNG…

It turns out that we need a seed of at least 226 bits in order to achieve the necessary degrees of freedom. This rules out most PRNGs that you might otherwise consider. There is at least one option however: Uncommons Maths’ AESCounterRNG. The AESCounterRNG is based on the AES block cipher. AES is the Advanced Encryption Standard adopted by the US government. The characteristics of AES that make it a good encryption algorithm also coincidentally make it a good RNG. When we use AES as an RNG, the seed is the encryption key. AES supports 128-bit, 192-bit and 256-bit keys.

Using AESCounterRNG with a 256-bit seed, we have the necessary degrees of freedom so that all 52! orderings are possible outcomes of a shuffle. AESCounterRNG, while still much faster than java.security.SecureRandom, is not quite as fast as the other RNGs we’ve been looking at. It does however have other advantages that we shall see later.

Note: Using encryption keys longer than 128 bits in Java requires that you obtain and install the unlimited strength cryptography policy files from Sun.

Security

Before you rush off to set up your new online casino, there’s one further issue to be considered. Remember that pseudorandom number generators are deterministic. That means that if we know which PRNG algorithm is in use and we know what seed was used, then we can predict with absolute certainty every single value that the RNG will ever produce. No problem for non-sensitive applications, but if people are gambling on the outcomes, you’d better be certain that they don’t have this kind of inside information.

Forget about seeding your RNG from the system clock – your seed data needs to be random itself or it is vulnerable (see How We Learned to Cheat at Online Poker: A Study in Software Security for a real world account of how RNG weaknesses can be exploited in online poker). Even if an attacker doesn’t know the seed, with many algorithms it can be reversed engineered by observing the RNG’s output and applying a little computing power.

To address these security concerns, the first step is to pick a PRNG implementation that is not susceptible to reverse-engineering. That brings us back to the AESCounterRNG. Being based on AES means that reverse-engineering the seed would involve cracking the cipher. Good luck doing that in a timely manner.

Disclaimer: You should not rely on the AESCounterRNG to address all of your security concerns. A fuller understanding of the issues can be gained by reading Ferguson and Schneier’s Practical Cryptography – particularly the description of their Fortuna PRNG design.

So where do we get the seed data from? We have a bit of a bootstrapping problem. In order to generate good quality, unpredictable random numbers we first need some good quality, unpredictable random data. By default, all of the Uncommons Maths RNGs seed themselves from /dev/random on Unix systems. On Windows systems seed data will be download from random.org and if neither of those strategies is succesful (perhaps the SecurityManager prevents file or network access) then it will fall back on the seeding strategy provided by SecureRandom.generateSeed. However, seeding strategies are pluggable so that you can provide your own. One possible strategy is to use a true hardware RNG to bootstrap the faster PRNG that will generate the actual output.

The End

And that’s pretty much everything I can tell you about using random number generators in Java. If you have any feedback on Uncommons Maths, please use the issue tracker for bugs and suggestions. If you want to know more about pseudorandom number generators, the following books cover the theoretical and practical issues in much more depth, and much more precisely, than I have done.

The Art of Computer Programming Practical Cryptography

Free Genetic Programming Book

Posted in Books, Evolutionary Computation by Dan on April 7th, 2008

I just spotted this on comp.ai.genetic.  The authors of a new book called A Field Guide to Genetic Programming have made it available for download in PDF form free of charge.  Weighing in at around 200 pages, it looks like a reasonably concise introduction to the topic (unlike some of the huge and hideously expensive GP books you can buy on Amazon).

This is good timing for me because I’ve recently started hacking together a GP example application to include in the next release of the Watchmaker Framework for Evolutionary Computation.  So I can catch up on a bit of background reading to make sure I’m doing things sensibly.  Watchmaker is a general purpose evolution framework, intended to address the full range of evolutionary algorithms.  I’ve been claiming for a while that you can use it for genetic programming, so I thought it was about time I demonstrated this.  I’m not aware of anybody having used Watchmaker for GP so far.  I’d love to hear from anybody who has done so.

Genetic programming is also covered in an accessible way in Toby Segaran’s excellent book, Programming Collective Intelligence, which includes GP examples in Python.

A Java Programmer’s Guide to Random Numbers, Part 2: Not just coins and dice

Posted in Java by Dan on April 6th, 2008

More than you ever wanted to know about randomness in Java programs.

This is the second in a series of articles about random numbers in Java programs. The series covers the following topics and also serves as an introduction to the random numbers package of the Uncommons Maths library:

  • True Random Numbers and Pseudorandom Numbers
  • Statistical Quality (making sure the dice are not loaded)
  • Performance (because sometimes you really do need half a million random numbers a second)
  • Different kinds of randomness (because not everything can be modelled by tossing a coin)
  • Degrees of Freedom (is every possible outcome actually possible?)
  • Security and Integrity (when not losing money depends on nobody knowing what happens next)

Part 2: Not just coins and dice

Part 1 of this series discussed different kinds of random number generators (RNGs), highlighted the issues with using the default Java RNGs (java.util.Random and java.security.SecureRandom) and introduced the Uncommons Maths library, which provides three fast, high-quality alternative Java RNGs.

But there is more to the random numbers package than just replacement RNGs. Uncommons Maths also provides convenient classes for working with several different probability distributions, which is pretty much essential for modelling anything other than the most simple random processes.

The Uncommons Maths random numbers demo application (WebStart required) demonstrates many of the concepts discussed in this article. It also serves as an example of the performance differences between different RNGs as discussed in the previous article.

Different Kinds of Randomness

Uniform Distribution

The Uniform Distribution is what you get by default from java.util.Random (and its sub-classes – including the Uncommons Maths RNGs). The discrete uniform distribution says that there are a certain number of possible outcomes and each is equally likely. The obvious example is rolling a dice. You can get a 1, 2, 3, 4, 5 or 6 and no single outcome is more likely than any of the others. If you record the results of 6 million dice rolls you would expect to get pretty close to a million of each number.

The uniform distribution is useful for some problems, such as shuffling a deck of cards, but it is not a good choice for many other scenarios that don’t involve a set of equally likely outcomes.

Normal (Gaussian) Distribution

The Normal Distribution (also known as the Gaussian Distribution) is the familiar bell curve. It is a good model for many real world phenomena. In a normal distribution, the average outcome is more likely than any other outcome. A slightly above or below average outcome is almost as likely, and extreme outcomes, while still possible, are very unlikely.

An example of this would be the distribution of IQ test scores. Most people would achieve somewhere close to an average score (maybe slightly above or below), but a few people would achieve very low scores and a similarly small number would achieve very high scores.

The nextGaussian() method of the java.util.Random class provides rudimentary support for obtaining normally-distributed values. Uncommons Maths’ GaussianGenerator class builds on this, making it easy to create distributions with different means and standard deviations. The standard deviation parameter controls how spread out the values are. A low standard deviation means most values will be very close to the mean (on either side). A high standard deviation increases the likelihood that a value will be a long way from the mean.

GaussianGenerator is a wrapper around a standard RNG (you can use any of the Uncommons Maths RNGs, or even one of the standard Java ones if you so choose). Once you have created a GaussianGenerator by specifying a mean, standard deviation and source RNG, you can call the nextValue() method ad infinitum:

Random rng = new MersenneTwisterRNG();
GaussianGenerator gen = new GaussianGenerator(100, 15, rng);
while (true)
{
    System.out.println(gen.nextValue());
}

If we were to generate thousands of simulated IQ test scores with this code, we would see that the vast majority of scores would be close to the average (100). Around 68% of the values would be within one standard deviation (15) of the mean (i.e. in the range 85 – 115), and approximately 95% would be within two standard deviations (between 70 and 130). These percentages are a well-known property of normal distributions.

The demo shows how the distribution of generated values compares to a theoretical normal distribution (the more values you generate, the closer the fit will be).

Binomial Distribution

The probability of a coin flip coming up heads is 0.5. This is pretty simple – we can just use the uniform distribution to simulate a single coin flip. But what is the probability of getting exactly 7 heads from 10 coin flips? This is where the Binomial Distribution comes in. It provides probabilities for the outcome of a series of trials, where each trial has two possible results – success or failure. The binomial distribution tells us that the probability of obtaining heads 7 times from 10 fair coin flips is about 0.12.

Suppose, for example, we wanted to randomly generate the number of times the number 6 occurred in 100 dice rolls. We could simulate the full series of trials using a uniform RNG, but this is cumbersome:

Random rng = new MersenneTwisterRNG();
int sixes = 0;
for (int i = 0; i < 100; i++)
{
    int diceRoll = rng.nextInt(6) + 1;
    if (diceRoll == 6))
    {
        ++sixes;
    }
}
System.out.println("No. of 6s from 100 rolls is " + sixes);

It is much easier, and more efficient, to use the binomial distribution. If we consider an occurrence of a 6 to be a “success”, then the probability of success is 1/6 (or ~0.1667). If we plug this value and the number of trials (100) into the Uncommons Maths BinomialGenerator class, we can get our answer from a single method call:

Random rng = new MersenneTwisterRNG();
BinomialGenerator gen = new BinomialGenerator(100, 1d/6, rng);
int sixes = gen.nextValue();
System.out.println("No. of 6s from 100 rolls is " + sixes);

Poisson Distribution

Suppose a telephone call centre between 2pm and 4pm in the afternoon receives calls at average of three per minute. What is the probability that we receive exactly five calls in the next minute? This is the kind of question that we can answer with the Poisson Distribution. The Poisson distribution is not dissimilar to the Binomial distribution, as you will see if you plot them. The difference is that the Poisson distribution is concerned with how many independent events occur within a set interval. Whereas values from a binomial distribution have an upper bound determined by the number of trials, there is no upper bound for a Poisson distribution. The probability of five calls in a minute when the average is three is around 0.17 according to the Poisson distribution.

In simulation applications we can use the PoissonGenerator class (its single parameter is the mean) to determine how many times something happens within a given interval.

Exponential Distribution

The Exponential Distribution is related to the Poisson Distribution. Rather than modelling how many events occur in a fixed period of time, it models the period of time between two independent events that occur at a given average rate.

Suppose you wanted to simulate some random event that happened on average 10 times a minute – perhaps you are simulating load for a server. You could simply generate events at the constant rate of exactly one every 6 seconds. But the real world is rarely this metronomic. A more realistic approach would be to generate events randomly, using an exponential distribution (provided by the ExponentialGenerator class) to determine the intervals between events:

final long oneMinute = 60000;
Random rng = new MersenneTwisterRNG();
 
// Generate events at an average rate of 10 per minute.
ExponentialGenerator gen = new ExponentialGenerator(10, rng);
boolean running = true;
while (true)
{
    long interval = Math.round(gen.nextValue() * oneMinute);
    Thread.sleep(interval);
 
    // Fire event here.
}

In a future article: Shuffling the deck – degrees of freedom, seeding and security concerns.

A Java Programmer’s Guide to Random Numbers, Part 1: Beyond java.util.Random

Posted in Evolutionary Computation, Java by Dan on April 3rd, 2008

More than you ever wanted to know about randomness in Java programs.

This is the first in a series of articles about random numbers in Java programs. The series covers the following topics and also serves as an introduction to the random numbers package of the Uncommons Maths library:

  • True Random Numbers and Pseudorandom Numbers
  • Statistical Quality (making sure the dice are not loaded)
  • Performance (because sometimes you really do need half a million random numbers a second)
  • Different kinds of randomness (because not everything can be modelled by tossing a coin)
  • Degrees of Freedom (is every possible outcome actually possible?)
  • Security and Integrity (when not losing money depends on nobody knowing what happens next)

Part 1: Beyond java.util.Random

Random numbers are useful in a wide variety of software applications. They provide a crucial element of uncertainty in an otherwise deterministic world. Without random numbers in computers, the hugely popular online poker industry would not exist, video games would be boring and predictable, iTunes would have no shuffle, cryptography would be much more difficult, and many innovative algorithms, such as those used in artificial intelligence and evolutionary computation, simply couldn’t work.

True Random Numbers and Pseudorandom Numbers

“Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.” – John von Neumann

Before continuing, it is important to make the distinction between so-called “true” random numbers and pseudorandom numbers. Though it may often seem otherwise to us as programmers, computer systems are infuriatingly deterministic. They are generally incapable of doing things randomly. To get a computer to behave in a way that is truly random, it is necessary to introduce some non-deterministic input. We could have somebody constantly rolling dice or flipping coins and typing in the results, but that is not very practical. A slightly more feasible approach is to construct a device that observes some real world phenomenon that is known to be unpredictable, such as radioactive decay or atmospheric noise. Data extracted from these events can be used a source of entropy for our applications. You could purchase a device that plugs into a serial port or USB port. To access these devices from Java you’d probably have to use C/C++ and JNI. Alternatively, you could get true random numbers indirectly – from an online service such as Random.org or Hotbits.

Since we can get truly unpredictable random numbers from this kind of hardware, why don’t we satisfy all of our randomness requirements in this way? Well the primary problem is throughput. These devices are quite limited in the quantity of randomness they can produce. They simply aren’t fast enough for many uses.

Pseudorandom numbers are not really random at all. They are the result of deterministic mathematical formulae. The best of these algorithms have been devised so that their output is statistically indistinguishable from true random numbers. PRNGs start with a single numeric seed value. The algorithm is applied to this seed to generate the output and a new, updated seed that is used to generate the next value. The mathematics involved is beyond the scope of this article – the definitive guide is the second volume of Donald Knuth’s The Art of Computer Programming.

An interesting property of this approach is that if you always start with the same seed value, you will always get the same sequence of “random” numbers. Though this can occasionally be useful, you would normally strive to avoid using the same seed value in the interests of unpredictability. A simple approach that is sufficient in many cases is to seed the PRNG from the current value of the system clock.

Aside from speed, another advantage of pseudorandom number generators (PRNGs) over true random number generators (TRNGs) is that they are more predictably unpredictable. That is, the statistical properties of PRNGs are more reliable than is the case with TRNGs.

Statistical Quality

xkcd.com

Why not java.util.Random?

The Java API helpfully provides a PRNG that we can use, the java.util.Random class. If you are a Java programmer you will almost certainly have used it at some point. For non-critical random numbers, e.g. adding some unpredictability to a game, it’s fine. It’s even pretty quick. Unfortunately, it’s not random enough – not by the standards required for more serious applications.

The problem of deciding whether a sequence of numbers is random is not an easy one to solve. You can’t simply look at the output and decide that it doesn’t look random enough. After all, if you toss a coin ten times, it could randomly come up heads every time, even though the probability of this sequence is pretty small. To get any kind of meaningful evaluation requires a large sample of the RNG output (perhaps millions of generated values). This sample can then be subjected to sophisticated statistical analysis.

Probably the best known test suite for random number generators is George Marsaglia’s Diehard Battery of Tests of Randomness. Diehard says that java.util.Random is not sufficiently random. But you don’t have to interpret Diehard’s complicated reports to see for yourself. This applet demonstrates it visibly.

Performance

So if not java.util.Random, how about Java’s other RNG, java.security.SecureRandom? SecureRandom is built for cryptography so it is specifically designed to avoid any such flaws. Diehard reports no issues with its output. Unfortunately, this quality comes at a high price – performance. In benchmarks, SecureRandom can be up to 60 times slower at generating random numbers than java.util.Random. This is bearable if you are only generating random values infrequently, but if your program relies on generating random numbers non-stop and as fast as possible (as many simulations do) then it’s a show-stopping bottleneck.

The good news is that, beyond the core API, there are random number generators as fast as (or faster than) java.util.Random with statistical properties as good as SecureRandom. The Uncommons Maths project provides a comprehensive open source random numbers package. Uncommons Maths provides three Random Number Generators, with different properties, for a wide variety of applications. Unlike java.util.Random, each of these RNGs completes the Diehard suite without any problems. Additionally, each of the Uncommons Maths RNGs is a sub-class of java.util.Random, which means any of them can be used as a drop-in replacement with minimal effort.

A good general-purpose PRNG is the MersenneTwisterRNG class. It is a pure Java port of Makoto Matsumoto and Takuji Nishimura’s proven and ultra-fast Mersenne Twister PRNG for C. Even faster is CellularAutomatonRNG, the Java port of Tony Pasqualoni’s experimental cellular automaton PRNG.

In the next article: Beyond dice and coins – using the right probability distribution.

Getters, Setters and the Great Coverage Conspiracy

Posted in Java, Software Development by Dan on April 1st, 2008

A frequent topic of Java-related blogs is whether it is worthwhile to write unit tests for simple getters and setters. This posting that I came across today proposes a reflection-based trick for eliminating much of the work in writing these tests. Maybe this is an improvement over other approaches, but what bothers me most is the motivation for wanting to test getters and setters in the first place.

It seems that many of those advocating unit testing simple getters and setters are driven by a desire to improve their coverage scores with the actual utility of the tests a secondary concern.

Firstly, I should state that I am absolutely in favour of measuring coverage for test suites. In fact, I think it’s pretty much essential. If you are writing automated tests but not measuring code coverage then you are just scratching around in the dark. What’s great about coverage reports, particularly those that show branch coverage as well as line coverage, is that you get to see exactly where your tests are neglecting certain scenarios. Coverage reports can also be useful in highlighting code that is not used and can be removed.

The problem with code coverage is that it only shows where your tests are weak. It does not prove that your tests are good, even if the coverage is 100%. So writing tests with the sole aim of improving the coverage score is merely an exercise in self-deception. It’s the tail wagging the dog.

If you need to add tests for all your getters and setters in order to achieve x% code coverage, where x is some mandated target, there are two questions you need to ask:

  1. Do you have too many getters and setters?
  2. Are you avoiding testing difficult code?

I could go on for pages about the first point. There are far too many getters and setters in most Java code. Too many developers think encapsulation is simply a case of making fields private and providing access to them with getters and setters. It would be better to aim for making fields private and not providing access to them with getters and setters. Favouring constructor-based dependency injection over setter-based DI is something else to consider (although that’s a whole other article in the making…).

How do you know if you have too many getters and setters? Well your coverage reports are a good starting point. If the getters and setters are essential to your application, it will be just about impossible to avoid exercising them indirectly from other tests. If you have good coverage elsewhere but the getters and setters aren’t touched, chances are they aren’t needed. Adding more tests is not the only way of improving your test coverage. Another way is to remove code so that you have less to test.

The second question above is also important. If you require your team to achieve a rigid 75% test coverage target then you are almost guaranteeing that you will get tests for the 75% of the application that is easiest to test. Writing tests for getters and setters helps to fulfil the 75% requirement without needing to think about how to test the difficult bits of the system. Unfortunately, the other 25% is probably the code that really needs testing/refactoring.

For me it’s pretty clear. Don’t write unit tests for getters and setters. Better still, don’t write getters and setters (except where necessary). And don’t confuse test-driven development with coverage-driven development.