New Adventures in Software

Java Archaeology: Revisiting 20th Century Code

Posted in Java by Dan on June 20th, 2008

Java 1.1 was released in 1997. How does code from this era compare to today’s Java code?

Following my decision to resurrect an old project, I was faced with the prospect of converting the codebase from Java 1.1 source code to something a little more civilised. The oldest parts of this software date back nearly 9 years. Even then Java 1.1 was a little old, with 1.2.2 being the latest and greatest. Though I re-wrote much of the code 4 or 5 years ago, when Java 1.4.2 was state-of-the-art, it still targeted Java 1.1 VMs for compatibility reasons (i.e. Microsoft’s VM that used to be bundled with IE).

Trawling through this code was an interesting look back to a time when Java developers could choose to use any type of collection as long as it was Vector, and pretty much the only GUI toolkit available was AWT. It was also interesting to revist the implementation decisions made back when I didn’t know nearly as much as I thought I did (though, to be honest, that’s probably still the case).

Pre-historic Collections

The first thing I noticed was the clunky use of collections and arrays in the code. Not only did this software pre-date generics, being tied to Java 1.1 meant that even the Collections Framework was off-limits. Arrays, Vectors and Hashtables were all there were.

My original code used arrays wherever a fixed length was acceptable and Vectors everywhere else. Preoccupied by the difficulty of keeping track of the type of the Vector’s contents, I had written many of the methods so that they converted their internal Vectors into arrays before returning. In modernising the code, I had similar problems remembering what types went in many of the collections that I had declared several years before. Don’t let anyone tell you that generics were not worth it. They’re invaluable for communication of intent alone.

On a similar note, the richer variety of collections that arrived with Java 2 allow for code to be written that more accurately conveys the thinking of the developer. The same method that returned an array or Vector in the legacy code could be rewritten to return List, Set, SortedSet, or just Collection. Each of these options communicates a different notion of what to expect from the method.

Sorting

The purists wept as one of the pillars of Computer Science education crumbled before them and a thousand copies of The Art of Computer Programming were consigned to the shelf indefinitely. The Collections Framework in Java 2 introduced a general purpose sort method that was good enough for the vast majority of your sorting needs. Rather than worrying about remembering how to implement QuickSort and why it was better than an Insertion Sort or a Bubble Sort, just use the modified Merge Sort that Josh Bloch and his colleagues had helpfully provided. The Comparator interface meant that this one tried-and-tested implementation was applicable to just about any problem.

The first things to go as I brought my code up-to-date were my plagiarised sort method and my custom version of the Comparator interface.

Enumerated Types

The lack of enumerated types in the early versions of Java was a curious oversight. I found my old code littered with integer constants. Effective Java remained unwritten in those days so I was ignorant of the type-safe enum pattern. I compounded my ignorance by writing methods that took three of these integer constants as arguments, each representing a conceptually different type. It was very easy to get the arguments the wrong way round and introduce all manner of bugs.

User Interface

It amazes me that people ever achieved anything worthwhile with just AWT. While I would use my mastery of the GridBagLayout to earn the respect of my peers, the lack of a rich set of GUI controls was very limiting. No tables, trees, tab sets or sliders for a start.

Swing has its critics, but I’m not one of them. There were some performance and look-and-feel issues in the early years, but I think that on the whole they got a lot more right than they got wrong. Now that I am not tied to Java 1.1, converting my project’s UI from AWT to Swing is the next item on the agenda. It will be with great pleasure that I rip out the sluggish and horrific table-simulating code that I built from a mess of GridLayouts and Labels.

[Swing was available as an external library for use with Java 1.1, but I had originally steered clear of this option because of the effect this would have had on the software’s download size]

Tools

When I began this project in 1999, Apache Ant was still a little-known part of the Tomcat project. I think JUnit was probably around then but it was not so well-known (certainly not to me). I wrote the original source code in the venerable PFE, a capable editor but it did not even have syntax-highlighting. I built the code using a Windows 98 batch file. The idea of writing automated unit tests did not even occur to me. These days IntelliJ IDEA, Ant and TestNG are essential to my development efforts.

I have no doubt that people will still be writing Java code in another decade, though how significant it will be, and how far removed from today’s code, remains to be seen. Maybe the rest of us will have all switched to JavaNG?

Software That Won’t Die: FSA goes Open Source

Posted in Java by Dan on June 18th, 2008

Having previously declared it deader than a Norwegian Blue, I’ve decided to resurrect my earliest, and unquestionably most obsolete, public Java project. The Football Statistics Applet (FSA) has been providing client-side league tables and statistics to numerous football websites for 8 or 9 years now.

Applets? Java 1.1? AWT? So last century. But despite that, it continues to be undoubtedly the most popular piece of software that I’ve written, even though I might have preferred one of my other projects to have claimed that tepid distinction.

A server-side solution (probably PHP and MySQL) to the same problem seems much more sensible, but it was born of an age (1999/2000) when even static web hosting accounts were much more expensive than those with server-side processing and databases are today.

So, other than a developer who is not able to let go of one of his creations, why not let it die?

Firstly, applets might just be about to make something of a come-back (possibly? …no? OK then). Sun is finally addressing some of the issues that have made them unpopular and is positioning Java as a viable contender in the RIA space. But even if this new momentum comes to nothing, the core of FSA could be useful in other applications. Maybe a servlet-based web-app, or it could be converted to a WebStart application.

Secondly, I receive enough mail about FSA to convince me that there are still people interested in seeing it revived. There are a number of minor enhancements and major new features that have been requested over the years.

Previously I avoided open-sourcing FSA because I didn’t feel that the source code represented the best that I was capable of. I was insecure. I didn’t want to be judged as a shoddy programmer. Now I don’t care what you think so FSA is now on Java.net under the terms of the GPL. I’ve started on dragging it into the 21st century. The first thing that had to go was the dependency on Java 1.1. I’ve converted everything to use Java 5.0 collections and language features and made the code a bit more OO (it’s now not so much C-in-disguise as it was before). The next task is to replace the ugly and rather basic AWT GUI with Swing.

Anyone wanting to join the effort of improving the mediocre code (no unit tests) or the rather sparse documentation is welcome to request a role on the project via Java.net.

The Database State: 42 Days? Try 149 Trillion Years

Posted in The Real World by Dan on June 12th, 2008

The big story in UK in the last couple of days has been the vote in the House of Commons on increasing the maximum period that somebody can be detained on suspicion of terrorism without charge to 42 days. When the Labour Party came to power in 1997 you could not be held without charge for more than 7 days. Under Tony Blair this was doubled to 14 days. It was later doubled again, to 28 days, but only after an attempt to extend it to 90 days was narrowly defeated.

Yesterday the government’s bill to extend detention without charge to 42 days was passed by the House of Commons (though it may yet be rejected by the House of Lords or ruled illegal by the European Court of Human Rights). This prompted prominent opposition MP David Davis to resign from Parliament this morning (more on him later).

Throughout the many debates on this issue over the last few years the justification in favour of the increase has been that terrorist plots are becoming ever more complex. We are constantly told about cases that require the police to examine hundreds of computers and thousands of CDs in the search for evidence. The extra time is needed, so we’re told, to allow police to access these “encrypted” files. The Prime Minister himself mentioned encryption in his press conference this morning (towards the end of the video clip embedded on that page):.

“…certainly involving encrypted computers and everything else… that they will need more time to deal with that.” – Gordon Brown

Any terrorist who allows their encryption to be cracked by the police within 42 days was not paying attention at terrorist school. The real world is not like a badly written crime drama where the stereotypical nerd cracks the bad guys’ encryption in less than an hour using a desktop PC and a 3D screensaver. Even a 128-bit AES key would take trillions of years to crack:

16. What is the chance that someone could use the “DES Cracker”-like hardware to crack an AES key?

In the late 1990s, specialized “DES Cracker” machines were built that could recover a DES key after a few hours. In other words, by trying possible key values, the hardware could determine which key was used to encrypt a message.

Assuming that one could build a machine that could recover a DES key in a second (i.e., try 2⁵⁵ keys per second), then it would take that machine approximately 149 thousand-billion (149 trillion) years to crack a 128-bit AES key. To put that into perspective, the universe is believed to be less than 20 billion years old.

So it seems to me that the debate should be about detaining suspects without charge for up to 149 trillion years (and let’s just hope that the terrorists don’t think to use 256-bit keys). Anything less would be an ineffective compromise. David Davis touched on this in his resignation speech:

“…because the generic security arguments relied on will never go away – technology, development and complexity and so on, we’ll next see 56 days, 70 days, 90 days…” – David Davis

David Davis is a Computer Science graduate, so he is probably more aware than other MPs of the absurdity of the idea of detaining suspects while their encryption is cracked.

Davis has chosen to wage war on the Labour government’s approach to civil liberties. In his criticism of the national DNA database and proposed ID cards he coined the term “The Database State“. I think this is a phrase we will be hearing a lot more of in the coming months. Especially given recent failures to protect sensitive data.

Great Innovations in Computing

Posted in Software Development by Dan on June 8th, 2008

Jeff Atwood poses the question “What do you think the single greatest invention in computer science is [other than the computer itself]?”

Of course, all of this depends on how strictly you define Computer Science and whether or not you include innovations that could more accurately be classified as Software Engineering or Electronic Engineering achievements.

“Computer science is no more about computers than astronomy is about telescopes.” – Edsger Dijkstra

Jeff himself, quoting Steve McConnell, opts for the “routine” (as in procedure/method/function) as the greatest invention. The comments on Coding Horror suggest alternatives, with the compiler being perhaps the most compelling contender.

My vote goes elsewhere. A strong case could be made for the Internet (in general, or the World Wide Web specifically) as the greatest invention. But, like Jeff, I’m going for something much more fundamental. The question is framed to exclude the computer itself, but I would identify one particular refinement to the design of the modern computer as being arguably the most significant milestone in Computer Science. Or perhaps more accurately, it is an advance in the design of the machines that allow practical application of a science that Dijkstra suggested did not really need machines at all.

In February 1946, The University of Pennsylvania unveiled the ENIAC, the first general-purpose, Turing-complete electronic computer. It was programmed by manipulating switches and cables. By hand. The ENIAC project had been in progress since 1943 and, by the time of its unveiling, work was already underway on its successor, the EDVAC. In June 1945, John von Neumann distributed a document entitled First Draft of a Report on the EDVAC. This 10-page document, written before the end of the Second World War, contained the outline of pretty much every general-purpose computer built since. The design of the EDVAC differed from the ENIAC in two fundamental ways. Firstly, it used binary numbers rather than decimal. Perhaps even more significantly, it was to be a stored-program computer.

By defining an instruction set and treating computations as sequences of these instructions, stored in the machine’s memory just like data, a stored-program computer is considerably more flexible than previous hard-wired machines.

Like all the best ideas it seems fairly obvious in hindsight, but the implications of this innovation are difficult to overstate. As well as the powerful flexibility it offered, the advent of the stored-program computer meant that it was now possible to write programs that wrote programs. This was an advance that would eventually lead to that other great computing invention: the compiler. Code was data.

The stored-program approach came to be known as the von Neumann architecture, though this ignores the contributions of von Neumann’s collaborators and contemporaries. Many of the ideas were reportedly established before von Neumann became involved in the project. Nor was the EDVAC the first working stored-program computer. That honour goes to the University of Manchester’s Small-Scale Experimental Machine (SSEM or “Baby”), which became operational in June 1948.

Visual SourceSafe: A Public Service Announcement

Posted in Software Development by Dan on May 23rd, 2008

Microsoft’s Visual SourceSafe has never been a good option for revision control. Even before the emergence of the likes of Subversion, Mercurial and Git, there have always been better free solutions available.

The fact that people still use VSS in 2008 scares me.

I’m not particularly anti-Microsoft. If you choose to use Windows, or Office, or even IE – fine. But a decision to use Visual SourceSafe is not one that can be rationally defended. Too often it gets picked because it comes bundled with a Microsoft development subscription that has to be paid anyway:

“Well, we’ve paid for it, so we might as well use it.”

“After all, something we paid for is going to be better than something free, right?”

No. No. No. DO NOT USE VISUAL SOURCESAFE FOR ANYTHING. Ever.

How Bad Can It Be?

Visual SourceSafe and I go back a long way. In my first job out of university, we used VSS. I had not been exposed to version control systems before. They didn’t teach source code management at my university in those days despite it being arguably the most important tool for professional software development.

I got on fine with VSS. It seemed to do the job. Sure it was slow but we could wait. And it only worked on Windows but we were all running NT 4.0 anyway. And sometimes the repository would get corrupted but we had backups. And we couldn’t easily check out two working copies of the same project on the same machine but we could work around this. And the exclusive checkout model was a bit restrictive but it seemed to make sense and we could just take turns editing the crucial files.

Then somebody insisted that all future projects would use CVS. This was bad – or so I thought. VSS had a reasonably friendly GUI, albeit with a few odd quirks. CVS had the intimidating CLI or some really awkward GUIs (it was a while before I would discover the excellent SmartCVS). All that merging and branching was very complicated too. Surely that wasn’t a good idea? But after a period of complaining I came to respect CVS even though it would sometimes tell me “I HATE YOU“. It was clearly a product of the Sticky-Tape and String school of software development but it worked pretty well.

Visual SourceSafe was less sticky-tape and string and more sticky-tape and a severe blow to the head.

“Visual SourceSafe? It would be safer to print out all your code, run it through a shredder, and set it on fire.” – (Attributed to an unidentified Microsoft employee).

Masquerading as a serious professional tool, VSS exhibits a staggeringly inappropriate architecture. There is no server process for VSS. Everything is based on Windows file-sharing. This obviously has serious implications for security. Client software is trusted implicitly. It also explains the poor performance and susceptibility to repository corruption. Everything is stored on the network share, even the dimensions for each user’s VSS Explorer window (there is nothing to stop you from editing your colleague’s local preferences in the shared .INI file). Maybe if you have a fast, reliable network, daily backups, and you’re very trusting, you can use VSS without major problems. Maybe.

It Gets Worse

That of course assumes that you don’t need to access your source code remotely. Unfortunately, my pre-CVS days were not the last time that I encountered SourceSafe.

I later joined another company who hadn’t got the memo. At least this time we had a solution for remote access. I say “solution”, it was more a proof-of-concept. By “proof-of-concept” I mean that somebody saw it working once. Probably.

Clearly we couldn’t just expose the network share on the Internet, so the idea was to establish a VPN connection to the office and then use the VSS client normally. Even with a broadband connection this was intolerable. The inefficiencies of SourceSafe were not always apparent on a 100Mb LAN but they were all too obvious over ADSL. Since people were usually not away from the office for more than a couple of days at a time, it was easier just to plan your work in advance and check in your changes when you got back (just make sure nobody else is going to need to edit the same files while you are away). To add insult to injury, we were increasingly developing for Solaris. We couldn’t access SourceSafe from the Sparc machines but at least we had scp to bridge the gap.

Anyway, to cut a long story short, we eventually ditched VSS in favour of Subversion but not before I discovered my two favourite “features”. Perhaps the most entertaining is the timezone problem. If you have clients in different timezones – or even on the same LAN but with clocks out-of-sync – when one client checks in a change from “the future”, the other can’t see it until its clock has caught up.

The other problem is that, once you have deleted a file, you cannot then re-add another file with the same name in the same location without purging the history of the first file. This situation will happen at least once in the life of a project (after all, developers sometimes change their minds). Once you have purged the history of the original you then cannot retrieve a complete snapshot of the project for any version that included that file. That point is worth emphasising:

VISUAL SOURCESAFE DOES NOT MAINTAIN A HISTORY OF ALL COPIES OF ALL FILES UNDER ALL CIRCUMSTANCES.

And really, if that’s the case, why bother with version control at all? Even ignoring the several other problems SourceSafe is not fit for purpose.

Use Subversion, use Git, buy Perforce, buy BitKeeper, even use CVS if you must (though Subversion should be considered first). Just don’t use Visual SourceSafe.

NOTE: In the interests of accuracy, it is worth mentioning that my experience is primarily with version 6.0 and earlier of SourceSafe. Visual SourceSafe 2005 introduces a server process as a helper to address some of the issues, though it is really just papering over the cracks as the underlying architecture remains. It is also worth noting that there are 3rd party add-ons to SourceSafe to improve the remote access situation. But why pay for a patch for a defective version control system when you can get one that works for free?

Why are you still not using Hudson?

Posted in Java, Software Development by Dan on May 9th, 2008

This week Hudson was awarded the Duke’s Choice Award in the Developer Solutions category at JavaOne.

In the space of a couple of years, Hudson has come from nowhere to become the leading contender among Continuous Integration servers. It’s head and shoulders above the other free alternatives, and arguably at least as good as the commercial offerings.

The venerable Cruise Control led the way for many years, but it suffers for having been first. Configuring builds is a lot more work than it needs to be. Continuum improved on this by allowing everything to be controlled via the web interface. Continuum is simple and useful. For a while I used it and I was happy.

Then JetBrains gave away free TeamCity licences with IntelliJ IDEA 6.0 and opened my eyes to a world beyond the fairly basic functionality of Continuum. I was impressed (pre-tested commits are a neat feature), but because you needed licences for every user, I was never able to fully commit to it.

NOTE: JetBrains have since introduced a free professional edition of TeamCity.

Anyway, at some point last year I took a serious look at Hudson. I’d been vaguely aware of it for a little while but never been compelled to try it out.

Hudson is impressive. It is ridiculously easy to install. It has the same ease-of-configuration that makes Continuum so simple, but it combines it with high-end features, such as distributed builds, that are usually only found in commercial offerings like TeamCity and Atlassian’s Bamboo.

Hudson is primarily the work of Sun Microsystem’s Kohsuke Kawaguchi. Kohsuke is a prolific Open Source developer. With Hudson he has designed a system with well thought-out extension points that have enabled an army of plug-in writers to deliver a bewildering array of capabilities.

Out-of-the box Hudson supports CVS and Subversion repositories. Plug-ins extend this list to include Git, Mercurial, Perforce, ClearCase, BitKeeper, StarTeam, Accurev and Visual SourceSafe. Hudson also supports pretty much any type of build script (including Ant, Maven, shell scripts, Windows batch files, Ruby, Groovy and MSBuild).

In addition to e-mail and RSS, Hudson can also notify you of build events via IRC and Jabber as well as via a system tray/dock applet. Of course, all of these were too mundane for Kohsuke, so he built his own glowing orb.

But Hudson is much more than just a Continuous Integration server. It’s a complete build management and tracking solution. As well as publishing Javadocs, archiving build artifacts, and monitoring and graphing JUnit/TestNG results over time, you can also track and plot code coverage (Cobertura, EMMA and Clover are all supported) and coding violations (via FindBugs, Checkstyle and/or PMD). And if all that’s not enough, you can play the Continuous Integration game.

So why are you still not using Hudson?

Eat, Sleep and Drink Software Development: Finding The Zone

Posted in Software Development by Dan on May 5th, 2008

Tired Programmers Damage Your Project

I completely agree with David Heinemeier Hansson’s recent article, Sleep Deprivation is not a Badge of Honor.

I’ve seen many examples of this kind of counter-productive attitude in software development. From developers on an hourly rate contributing 100-hour-plus work weeks to maximise their pay, to teams being expected to work evenings and weekends just to be seen to be doing something to rescue late projects, even though that “something” is ultimately detrimental both to the project and to the team.

The accepted wisdom seems to be that if we can do X amount of work in 40 hours then in 80 hours we ought to be able to do, if not 2 * X, then at least more than X. This is based on the dubious assumption that some extra work is always better than no extra work. We may expend more effort but it’s quite likely that the effort will be wasted introducing bugs, making poor design decisions and doing other things that will invariably cause more work later on.

The ever-infallible Wikipedia lists confusion, loss of concentration, impatience, memory lapses, depression and psychosis among the many effects of sleep-deprivation. These don’t sound like ideal traits for a software developer.

Speaking from personal experience, the impact of tiredness on my concentration is the real productivity killer. If you put in extra hours at the beginning of the week you may find it impossible to repay the debt until the weekend, which means you’ll be working at a sub-optimal level for days.

At a company were I worked several years ago we would routinely work late into the evening to try to get more done. Except there was one developer who was extremely reluctant to put in additional hours and generally managed to avoid doing so. This was tolerated because he was consistently the most productive member of the team and produced the best quality code. It didn’t occur to me at the time but the reason he produced the best work was probably largely because he wasn’t working stupid hours, unlike the burnt-out hackers who were checking in poorly thought-out, error-strewn code every night.

It is vital to be able to disengage form the task at hand. To go away and come back with a fresh perspective and new insights. You can’t see the big picture with your nose constantly pushed up against the canvas.

An Industry of Addicts

Tiredness often leads to programmers relying on stimulants, usually caffeine, as a substitute for adequate rest. It would appear that developers who don’t have a caffeine dependency are in the minority. I’ve had colleagues who need two cans of Red Bull just to kick-start their brains in the morning and others who drink several cups of ridiculously strong black coffee to keep them going through the day. Of course, here in Blighty, the delivery method of choice is the humble cup of tea – backbone of the British Empire.

High caffeine consumption has a whole host of nasty side-effects that complement the effects of sleep-deprivation perfectly. Insomnia is the real kicker. You need to sleep, you are knackered, but you can’t sleep because of the caffeine. So you either go to bed later or you lie awake for hours. When the alarm goes in the morning you are not recharged. You drag yourself out of bed and drink that first coffee/tea/Red Bull to get you started for the day. You are an addict, just in a slightly more socially-acceptable way than if you were smoking crack in the back alley.

“The Zone” and Peak Mental Performance

All developers are familiar with “The Zone”: that elusive state of mind where the code flows and in an hour we achieve more than we could in a week outside of the zone. What is less clear is how do we get into the zone in the first place? Lack of sleep doesn’t help. If you are tired you won’t find the zone. Too much caffeine probably has a similar effect.

So what else affects our mental performance? I am ignorant in these matters but it seems reasonable that diet and general well-being would play a part. This makes Google’s strategy of free meals and snacks particularly interesting. Not only are they providing a perk that may encourage people to work there, nor are they merely encouraging their employees to lunch together to encourage team-building and sharing of ideas. They are also taking control of their workers’ nutrition, rather than leaving them to subsist on Mars Bars and Coke. It would be fascinating to see a study into what kind of impact this had on staff performance and whether it came close to offsetting the apparently huge cost. The best athletes leave nothing to chance in their preparation. Nutrition is part of this. Maybe it’s the same for more intellectual endeavours?

A Java Programmer’s Guide to Random Numbers, Part 3: Seeding

Posted in Java by Dan on April 10th, 2008

The trilogy concludes.

This is the last of three articles about random numbers in Java programs. The series covers the following topics and also serves as an introduction to the random numbers package of the Uncommons Maths library:

True Random Numbers and Pseudorandom Numbers
Statistical Quality (making sure the dice are not loaded)
Performance (because sometimes you really do need half a million random numbers a second)
Different kinds of randomness (because not everything can be modelled by tossing a coin)
Degrees of Freedom (is every possible outcome actually possible?)
Security (when not losing money depends on nobody knowing what happens next)

Part 3: Seeding

Part 1 of this series discussed different kinds of random number generators (RNGs), highlighted the issues with using the default Java RNGs (java.util.Random and java.security.SecureRandom) and introduced the Uncommons Maths library, which provides three fast, high-quality alternative Java RNGs. Part 2 showed how to generate random numbers from different probability distributions using the classes provided by Uncommons Maths.

One aspect of pseudorandom number generators (PRNGs) that we glossed over in part 1 is the issue of seeding. In this article, we’ll look at two important issues related to the seeding of PRNGs.

Degrees of Freedom

Suppose we want to shuffle a standard deck of 52 playing cards. The Java API provides an implementation of a very good algorithm for shuffling a list (see Collections.shuffle). It even provides a variant of this method that allows us to specify our own RNG to be used. So shuffling a deck of cards (assuming it is represented as a List) is easy in Java.

Unfortunately, even though the algorithm is a perfect shuffle and we can choose an RNG that exhibits no bias, our shuffling code is still likely to be flawed in one respect. The problem is that the RNG is almost certainly not able to generate every possible outcome at each state transition. There are many orderings of the deck that cannot occur. The RNG has insufficient degrees of freedom. To understand why this is, we need to consider how many ways there are of ordering a deck of 52 cards. I’ll spare you the maths and tell you that the answer is 52! (that’s 52 factorial) or, in longhand, 80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000.

That’s a pretty big number.

Recall from the first article in this series that when we seed a PRNG we get the same sequence of output each time that we use the same initial seed. The maximum number of possible initial states for the PRNG is equal to the number of possible seed values. If each RNG state gives us a different shuffle, the number of different deck orderings we can generate is determined by the range of the RNG seed.

The CellularAutomatonRNG is very fast, but it only uses a 32-bit seed. So using it for shuffling restricts the result of any given shuffle to one of 4,294,967,296 outcomes. That’s 4.3 billion possibilities but it may as well be zero given the number of outcomes that should be possible but aren’t. [Note that a second shuffle may well allow a different set of 4,294,967,296 permutations than the first (so over time a larger number of outcomes are possible) but for any n^th shuffle the number of possible outcomes is restricted by the number of starting states.]

We would probably want to avoid using java.util.Random but, if we did use it, we’d give it a 64-bit long as its seed. The algorithm only actually uses 48 of those bits for seeding so that gives us 281,474,976,710,656 possible initial RNG states. 281 trillion is an improvement, but still woefully inadequate.

The MersenneTwisterRNG uses a 64-bit seed. That’s 18,446,744,073,709,551,616 possible outcomes for any given shuffle. Even this huge number is a miserable, tiny, insignificant fraction of one percent of the full set of 52! possibilities that we ought to be able to generate.

SecureRandom may use different algorithms internally, depending on the platform. On Windows, the default is the SHA1PRNG provider, which gives us a 160-bit seed. Ignoring SecureRandom’s slowness, this is getting us closer to where we want to be (if you can call less than one trillionth of one percent of the possible outcomes ‘close’).

One more RNG…

It turns out that we need a seed of at least 226 bits in order to achieve the necessary degrees of freedom. This rules out most PRNGs that you might otherwise consider. There is at least one option however: Uncommons Maths’ AESCounterRNG. The AESCounterRNG is based on the AES block cipher. AES is the Advanced Encryption Standard adopted by the US government. The characteristics of AES that make it a good encryption algorithm also coincidentally make it a good RNG. When we use AES as an RNG, the seed is the encryption key. AES supports 128-bit, 192-bit and 256-bit keys.

Using AESCounterRNG with a 256-bit seed, we have the necessary degrees of freedom so that all 52! orderings are possible outcomes of a shuffle. AESCounterRNG, while still much faster than java.security.SecureRandom, is not quite as fast as the other RNGs we’ve been looking at. It does however have other advantages that we shall see later.

Note: Using encryption keys longer than 128 bits in Java requires that you obtain and install the unlimited strength cryptography policy files from Sun.

Security

Before you rush off to set up your new online casino, there’s one further issue to be considered. Remember that pseudorandom number generators are deterministic. That means that if we know which PRNG algorithm is in use and we know what seed was used, then we can predict with absolute certainty every single value that the RNG will ever produce. No problem for non-sensitive applications, but if people are gambling on the outcomes, you’d better be certain that they don’t have this kind of inside information.

Forget about seeding your RNG from the system clock – your seed data needs to be random itself or it is vulnerable (see How We Learned to Cheat at Online Poker: A Study in Software Security for a real world account of how RNG weaknesses can be exploited in online poker). Even if an attacker doesn’t know the seed, with many algorithms it can be reversed engineered by observing the RNG’s output and applying a little computing power.

To address these security concerns, the first step is to pick a PRNG implementation that is not susceptible to reverse-engineering. That brings us back to the AESCounterRNG. Being based on AES means that reverse-engineering the seed would involve cracking the cipher. Good luck doing that in a timely manner.

Disclaimer: You should not rely on the AESCounterRNG to address all of your security concerns. A fuller understanding of the issues can be gained by reading Ferguson and Schneier’s Practical Cryptography – particularly the description of their Fortuna PRNG design.

So where do we get the seed data from? We have a bit of a bootstrapping problem. In order to generate good quality, unpredictable random numbers we first need some good quality, unpredictable random data. By default, all of the Uncommons Maths RNGs seed themselves from /dev/random on Unix systems. On Windows systems seed data will be download from random.org and if neither of those strategies is succesful (perhaps the SecurityManager prevents file or network access) then it will fall back on the seeding strategy provided by SecureRandom.generateSeed. However, seeding strategies are pluggable so that you can provide your own. One possible strategy is to use a true hardware RNG to bootstrap the faster PRNG that will generate the actual output.

The End

And that’s pretty much everything I can tell you about using random number generators in Java. If you have any feedback on Uncommons Maths, please use the issue tracker for bugs and suggestions. If you want to know more about pseudorandom number generators, the following books cover the theoretical and practical issues in much more depth, and much more precisely, than I have done.

Free Genetic Programming Book

Posted in Books, Evolutionary Computation by Dan on April 7th, 2008

I just spotted this on comp.ai.genetic. The authors of a new book called A Field Guide to Genetic Programming have made it available for download in PDF form free of charge. Weighing in at around 200 pages, it looks like a reasonably concise introduction to the topic (unlike some of the huge and hideously expensive GP books you can buy on Amazon).

This is good timing for me because I’ve recently started hacking together a GP example application to include in the next release of the Watchmaker Framework for Evolutionary Computation. So I can catch up on a bit of background reading to make sure I’m doing things sensibly. Watchmaker is a general purpose evolution framework, intended to address the full range of evolutionary algorithms. I’ve been claiming for a while that you can use it for genetic programming, so I thought it was about time I demonstrated this. I’m not aware of anybody having used Watchmaker for GP so far. I’d love to hear from anybody who has done so.

Genetic programming is also covered in an accessible way in Toby Segaran’s excellent book, Programming Collective Intelligence, which includes GP examples in Python.

A Java Programmer’s Guide to Random Numbers, Part 2: Not just coins and dice

Posted in Java by Dan on April 6th, 2008

More than you ever wanted to know about randomness in Java programs.

This is the second in a series of articles about random numbers in Java programs. The series covers the following topics and also serves as an introduction to the random numbers package of the Uncommons Maths library:

True Random Numbers and Pseudorandom Numbers
Statistical Quality (making sure the dice are not loaded)
Performance (because sometimes you really do need half a million random numbers a second)
Different kinds of randomness (because not everything can be modelled by tossing a coin)
Degrees of Freedom (is every possible outcome actually possible?)
Security and Integrity (when not losing money depends on nobody knowing what happens next)

Part 2: Not just coins and dice

Part 1 of this series discussed different kinds of random number generators (RNGs), highlighted the issues with using the default Java RNGs (java.util.Random and java.security.SecureRandom) and introduced the Uncommons Maths library, which provides three fast, high-quality alternative Java RNGs.

But there is more to the random numbers package than just replacement RNGs. Uncommons Maths also provides convenient classes for working with several different probability distributions, which is pretty much essential for modelling anything other than the most simple random processes.

The Uncommons Maths random numbers demo application (WebStart required) demonstrates many of the concepts discussed in this article. It also serves as an example of the performance differences between different RNGs as discussed in the previous article.

Different Kinds of Randomness

Uniform Distribution

The Uniform Distribution is what you get by default from java.util.Random (and its sub-classes – including the Uncommons Maths RNGs). The discrete uniform distribution says that there are a certain number of possible outcomes and each is equally likely. The obvious example is rolling a dice. You can get a 1, 2, 3, 4, 5 or 6 and no single outcome is more likely than any of the others. If you record the results of 6 million dice rolls you would expect to get pretty close to a million of each number.

The uniform distribution is useful for some problems, such as shuffling a deck of cards, but it is not a good choice for many other scenarios that don’t involve a set of equally likely outcomes.

Normal (Gaussian) Distribution

The Normal Distribution (also known as the Gaussian Distribution) is the familiar bell curve. It is a good model for many real world phenomena. In a normal distribution, the average outcome is more likely than any other outcome. A slightly above or below average outcome is almost as likely, and extreme outcomes, while still possible, are very unlikely.

An example of this would be the distribution of IQ test scores. Most people would achieve somewhere close to an average score (maybe slightly above or below), but a few people would achieve very low scores and a similarly small number would achieve very high scores.

The nextGaussian() method of the java.util.Random class provides rudimentary support for obtaining normally-distributed values. Uncommons Maths’ GaussianGenerator class builds on this, making it easy to create distributions with different means and standard deviations. The standard deviation parameter controls how spread out the values are. A low standard deviation means most values will be very close to the mean (on either side). A high standard deviation increases the likelihood that a value will be a long way from the mean.

GaussianGenerator is a wrapper around a standard RNG (you can use any of the Uncommons Maths RNGs, or even one of the standard Java ones if you so choose). Once you have created a GaussianGenerator by specifying a mean, standard deviation and source RNG, you can call the nextValue() method ad infinitum:

Random rng = new MersenneTwisterRNG();
GaussianGenerator gen = new GaussianGenerator(100, 15, rng);
while (true)
{
    System.out.println(gen.nextValue());
}

If we were to generate thousands of simulated IQ test scores with this code, we would see that the vast majority of scores would be close to the average (100). Around 68% of the values would be within one standard deviation (15) of the mean (i.e. in the range 85 – 115), and approximately 95% would be within two standard deviations (between 70 and 130). These percentages are a well-known property of normal distributions.

The demo shows how the distribution of generated values compares to a theoretical normal distribution (the more values you generate, the closer the fit will be).

Binomial Distribution

The probability of a coin flip coming up heads is 0.5. This is pretty simple – we can just use the uniform distribution to simulate a single coin flip. But what is the probability of getting exactly 7 heads from 10 coin flips? This is where the Binomial Distribution comes in. It provides probabilities for the outcome of a series of trials, where each trial has two possible results – success or failure. The binomial distribution tells us that the probability of obtaining heads 7 times from 10 fair coin flips is about 0.12.

Suppose, for example, we wanted to randomly generate the number of times the number 6 occurred in 100 dice rolls. We could simulate the full series of trials using a uniform RNG, but this is cumbersome:

Random rng = new MersenneTwisterRNG();
int sixes = 0;
for (int i = 0; i < 100; i++)
{
    int diceRoll = rng.nextInt(6) + 1;
    if (diceRoll == 6))
    {
        ++sixes;
    }
}
System.out.println("No. of 6s from 100 rolls is " + sixes);

It is much easier, and more efficient, to use the binomial distribution. If we consider an occurrence of a 6 to be a “success”, then the probability of success is 1/6 (or ~0.1667). If we plug this value and the number of trials (100) into the Uncommons Maths BinomialGenerator class, we can get our answer from a single method call:

Random rng = new MersenneTwisterRNG();
BinomialGenerator gen = new BinomialGenerator(100, 1d/6, rng);
int sixes = gen.nextValue();
System.out.println("No. of 6s from 100 rolls is " + sixes);

Poisson Distribution

Suppose a telephone call centre between 2pm and 4pm in the afternoon receives calls at average of three per minute. What is the probability that we receive exactly five calls in the next minute? This is the kind of question that we can answer with the Poisson Distribution. The Poisson distribution is not dissimilar to the Binomial distribution, as you will see if you plot them. The difference is that the Poisson distribution is concerned with how many independent events occur within a set interval. Whereas values from a binomial distribution have an upper bound determined by the number of trials, there is no upper bound for a Poisson distribution. The probability of five calls in a minute when the average is three is around 0.17 according to the Poisson distribution.

In simulation applications we can use the PoissonGenerator class (its single parameter is the mean) to determine how many times something happens within a given interval.

Exponential Distribution

The Exponential Distribution is related to the Poisson Distribution. Rather than modelling how many events occur in a fixed period of time, it models the period of time between two independent events that occur at a given average rate.

Suppose you wanted to simulate some random event that happened on average 10 times a minute – perhaps you are simulating load for a server. You could simply generate events at the constant rate of exactly one every 6 seconds. But the real world is rarely this metronomic. A more realistic approach would be to generate events randomly, using an exponential distribution (provided by the ExponentialGenerator class) to determine the intervals between events:

final long oneMinute = 60000;
Random rng = new MersenneTwisterRNG();
 
// Generate events at an average rate of 10 per minute.
ExponentialGenerator gen = new ExponentialGenerator(10, rng);
boolean running = true;
while (true)
{
    long interval = Math.round(gen.nextValue() * oneMinute);
    Thread.sleep(interval);
 
    // Fire event here.
}

In a future article: Shuffling the deck – degrees of freedom, seeding and security concerns.

« Older Posts

New Adventures in Software by Dan Dyer