Distributed Evolutionary Algorithms with Watchmaker and Hadoop

Posted in Evolutionary Computation, Java by Dan on October 1st, 2008

One feature that has been on the TODO list of the Watchmaker Framework for Evolutionary Computation for some time is the ability to distribute the evolution across several machines.  Some time last year I started on a RMI-based solution, but I wasn’t happy with it so I deleted it and put the idea on the back burner while I concentrated on other things.  At some point I wanted to investigate using Terracotta, or possibly Hadoop, to distribute the computations.

However, it’s often the case with Open Source software that somebody smarter comes along and does the hard work for you.  I was delighted to find out today that Abdel Hakim Deneche has been busy integrating Watchmaker with the Apache Mahout project as part of Google’s Summer of Code programme.

I’d never heard of Mahout before.  According to Wikipedia, a Mahout is somebody who drives an elephant.  Apache Mahout is a sub-project of Lucene, the Java text search and indexing engine.  The Mahout project is focused on building scalable machine-learning libraries using Hadoop (presumably where the elephant connection comes in).

I haven’t yet tried using the Mahout software, but it looks like it provides a pretty straightforward way to distribute the fitness evaluations for just about any evolutionary algorithm implemented using Watchmaker.

More thoughts on Stackoverflow.com

Posted in Software Development, The Internet by Dan on September 26th, 2008

Since my previous post on the subject, Stackoverflow.com has moved from private beta to public beta. I’ve had more time to use the site and have some more thoughts.  The criticisms here are meant to be constructive. Hopefully the feedback from users will help the Stackoverflow team to make a good site even better.

Performance

First the good news. The site has transitioned from private to public very well. Jeff and his team seem to have got it right in terms of architecture and infrastructure because, even with the increased load, it remains blindingly fast.

Front Page

In terms of usability, I think there’s more that could be done to help me find the content that I’m interested in. The default front page is, to be honest, not very useful. New questions are coming in so fast and on so many topics that displaying the most recent questions is just noise.

I would prefer to have a personalised home page that shows me relevant questions based on my previous answering/voting history.  I realise that this is major new functionality and I’m not criticising the Stackoverflow team for not having this in the initial version, it makes sense to get the site up and running first. However, it would be great if this could be implemented at some point. I’m not alone on this one, it’s the second most popular requested feature at the moment.

Presently I’m finding stuff that I want to look at by going to the tags page and clicking on interesting topics. But I’m sure I’m missing out on questions that would be of interest if only I could find them.

Tag Cloud

The tag cloud on the right of the front page isn’t very helpful either. It’s ordered with the most recent first. If I just wanted to view questions tagged “html”, I’m going to struggle to find the tag in the cloud. An alphabetical ordering would be more usable. Unfortunately, this has already been suggested and rejected.

Voting and Reputation

I outlined my concerns on the voting mechanism previously. In the interests of being constructive, rather than just a whiny blogger, I’ve opened new issues on the Stackoverflow Uservoice page. If you agree with me, please vote on these issues:

Addressing each of these will help in resolving The Fastest Gun in the West Problem (currently the number one voted-on issue). The problem is that early answers get the votes and later, better answers are largely ignored. Removing the penalty for down-voting will encourage more down votes where they are deserved (so an early answer that is later shown to be wrong is less likely to retain a high score). Also, if a down vote was as powerful as an up vote, people might be more careful in crafting good answers as opposed to quick answers.

Source Control and Backups – More than just a good idea

Posted in Software Development by Dan on September 25th, 2008

Are there really software development teams out there that don’t use any form of proper source control at all, even the bad kind?  I’d like to think that it wasn’t the case but I’m not so naive.

There’s a reason that “Do you use source control?” is the first question on the Joel Test.  It’s because it’s the most important.  If you answer “no” to this question you shouldn’t be allowed to answer subsequent questions.  Even if the rest of your process is perfect, you score zero.  You failed at software development.  I could say that if your team doesn’t use source control it is a disaster waiting to happen, but more likely the disaster already happened and you haven’t noticed yet.

Of course, you and I aren’t nearly dumb enough to try developing anything more complex than “Hello World” without version control in place.  I’m sure I’m preaching to the converted.  The kind of people who read obscure software development blogs probably already know a few things about effective software development.

But how good are your back-ups?

You do have a back-up, don’t you?

If you don’t have a back-up you are one accidental key-stroke or one hardware failure away from scoring zero on the Joel Test (under my rules)… and failing at software development.  Hardware will fail, people will screw-up, disgruntled former employees will set fire to the building.  None of these is a problem but a failure to anticipate and prepare is.

How often do you back-up?

There is only one right answer to this: every day.  Weekly back-ups are too costly.  Can you really afford to have your whole team redo an entire week’s work?  The first time you lose a week’s work you will switch to daily back-ups, so why not just do it now?

A melted back-up is no back-up at all

Off-Site Storage. You could physically take tapes to another location or you could upload files to a remote server.  Just don’t leave them here.

Does it actually work?

Honestly, have you ever tried restoring your source control back-up onto a different machine?  The most comprehensive back-up plan imaginable is useless if you can’t restore the back-ups.  If you haven’t seen it working (recently) then it doesn’t work.  There’s a good time and a bad time to find out that your back-ups don’t work.  15 minutes after your source control server spontaneously combusted is the bad time.

Are you still here?  You should be checking those back-up tapes…

UPDATE: The good people of Stackoverflow are discussing what could possibly be a good excuse for not using source control.

Stackoverflow.com – First Impressions

Posted in Software Development, The Internet by Dan on September 12th, 2008

Over the last few days I’ve been playing with the beta of Stack Overflow. In case you are unaware, Stack Overflow is a joint venture between Jeff Atwood of Coding Horror and Joel Spolsky of Joel on Software fame. It’s basically a question and answers site for software developers. A mixture of Experts Exchange, Proggit and Wikipedia. The site is scheduled to come out of beta on Monday when it will open its doors to everyone.

From initial impressions I think it’s fair to say that the site will be a success, initially at least. Being A-list bloggers (and now podcasters too), Jeff and Joel have been able to generate a lot of exposure for their project.

Like Jeff’s blog, the minimalist site design is clean and bold and so far the whole system is very responsive (we’ll see if that’s still the case when the traffic spikes on Monday). The beta audience are already posting thousands of questions, almost all of which generate extremely prompt answers (of varying quality).

However, I think the site suffers a little from the ambitions of trying to be too many different things; is it a programming forum, or is it a Wiki?. There are a lot of different ideas in the implementation that interact via a quite complicated set of rules that have evolved over the course of the beta.

Reputation & Badges

Stack Overflow has two mechanisms for measuring a user’s standing within the community. Firstly, each user has a reputation score. This starts at 1 and increases as you make positive contributions (posting questions and answers that get voted up).  As you reach various milestones you get more privileges within the community, such as being able to vote on answers or tag other people’s questions.

Your reputation can be diminished if you get voted down or reported for abuse, but it can’t go below 1 and on the whole it’s heavily biased in favour of upward movement.

The second incentive for users to contribute is the ability to collect “badges”. This works exactly like the Cub Scouts. Some badges are easy to achieve (just fill in your profile or post your first question), and others are much harder to obtain (get 100 up votes for one of your answers).

Voting

Voting is one area of the site that I think could do with an overhaul. It’s unbalanced and not transparent enough. If your answer gets voted up you gain 10 points of reputation. But if your answer gets voted down you only lose 2 points. So if you post something that sounds plausible to the uninformed masses but is actually wrong, you could get 5 up votes and 6 down votes for a net score of -1 yet still gain a 38-point reputation boost. An up vote should have equal weight to a down vote, just like on DZone or Slashdot. It also might be better to show both the number of up votes and the number of down votes (as on DZone) rather than just the net total. This would make it easier to identify controversial content (something with 10 down votes and 12 up votes is not quite the same as something with no down votes and 2 up votes).

Another problem with the voting is that down votes penalise the voter as well as the user whose answer is being voted on. So if you post something wrong like “Java passes objects by reference”, I can either ignore it or lose 1 point of reputation for giving you the down vote that you deserve (even then it will take five of us to fully cancel out the one up vote that you got from someone who didn’t know better).

When I queried the justification for penalising down-voters I was told that it was to combat attempts to game the system. Apparently, earlier in the beta, users were posting answers to questions and then voting down everybody else’s answers so that their answer would appear at the top. The idea was that by making users pay to vote down this behaviour would be discouraged. A better solution to this problem would have been to remove the conflict of interest by not allowing users to answer and vote on the same question (which is how Slashdot’s mod points work), rather than punishing all down votes across the whole site.

The net effect of this voting system is that everybody’s reputation increases pretty quickly. Beyond the minimum score required to get full privileges the numbers can become meaningless. To avoid having to rename the site Integer Overflow there are a couple of artifical limits that restrict the number of votes you can cast and the number of reputation points you can earn each day.

Other Thoughts

Aside from my reservations about the voting, my impressions of Stackoverflow are mostly positive. The fact that it has already attracted hundreds of enthusiastic participants suggests that it has genuinely found a niche. However, I do feel that it is probably more elaborate than it needs to be (I don’t really get the need for the Wiki functionality).

Further Reading

Denton Gentry and Sara Chipps have also written about their impressions of Stack Overflow. Or, on a less positive note, you could try Crap Overflow.

Avoid NIO, Get Better Throughput

Posted in Java by Dan on September 3rd, 2008

The Java NIO (new/non-blocking I/O) API introduced in Java 1.4 is arguably the most arcane part of the standard library. With channels, selectors, byte buffers and all the associated flipping, marking, compacting, event-handling and registering/de-registering of read/write interest, it’s an entirely different level of complexity to the old-fashioned, straightforward blocking I/O. And if you want to use SSL with NIO then it’s a whole new world of pain.

Few have mastered NIO. For most it provides an opportunity to really get to know your debugger. “Should this buffer be flipped before I pass it to this method, or should the method flip it?”. BufferOverflowExceptions and memory leaks abound.

So, in the spirit of doing the simplest thing that could possibly work, writing your own NIO code is usually best avoided unless you have a compelling reason. Fortunately, some masochistic individuals have done a lot of the hard work so that we don’t have to. Projects such as Grizzly and QuickServer provide proven, reusable non-blocking server components.

However, in most instances, maybe non-blocking I/O is not necessary at all? In fact, maybe it is detrimental to performance?

That’s the point that Paul Tyma makes. He attacks some of the received wisdom about the relative merits of blocking and non-blocking servers in Java. The characteristics of JVMs and threading libraries change as new advances are made. Good advice often becomes bad advice over time, demonstrating the importance of making your own measurements rather than falling back on superstitions.

Paul’s experiments show that higher throughput is achieved with blocking I/O, that thread-per-socket designs actually scale well, and that the costs of context-switching and synchronisation aren’t always significant. Paul’s slides from his talk “Thousands of Threads and Blocking I/O: The Old Way to Write Java Servers Is New Again (and Way Better)” are well worth a look.

If you are writing your own multi-threaded servers in Java, Esmond Pitt’s Fundamental Java Networking and Java Concurrency in Practice by Brian Goetz et al. are essential reading.

Real World Haskell

Posted in Books, Haskell by Dan on September 2nd, 2008

The book Real World Haskell by Bryan O’Sullivan, Don Stewart, and John Goerzen, will be available to buy from November.

The content is also freely available online already and is well worth a look if, like me, you are keen to learn more about developing actual useful programs with Haskell.

I first mentioned Real World Haskell last year. At the time I also highlighted GHC’s LGPL problems as an obstacle that could potentially discourage the wider adoption of Haskell. It seems that some progress is being made on that front. At present though, GHC will still statically link to GMP, which means that developers who distribute GHC-compiled binaries are distributing “derivative works” as defined by the LGPL.

More Stupid Java Tricks

Posted in Java by Dan on August 26th, 2008

My previous post was reasonably popular so I decided to follow-up with some more stupid Java tricks. It should go without saying that you shouldn’t use these techniques in any serious code, unless of course your objective is to write unmaintainable code.

It was pointed out to me by a couple of people that the puzzle I posed previously is also in Josh Bloch and Neal Gafter’s excellent Java Puzzlers book. If you are interested in all of the ugly corners of the Java platform, this book is well worth buying.

Unchecked Checked Exceptions

Throw checked exceptions without the hassle of dealing with them

There’s a lot of debate about the relative merits of checked and unchecked exceptions in Java. You can have the best of both worlds by throwing checked exceptions without either catching them or declaring them in a throws clause. The deprecated Thread.stop(Throwable) method is one way:

private void throwSomething() // Look, no throws clause!
{
    Thread.currentThread().stop(new IOException("This won't be caught."));
}

The slightly more socially-acceptable Class.newInstance() method is another way. It’s not deprecated but it does have one well-documented flaw:

Note that this method propagates any exception thrown by the nullary constructor, including a checked exception. Use of this method effectively bypasses the compile-time exception checking that would otherwise be performed by the compiler.

Invasion of Privacy

Strings aren’t really immutable, even literals can’t be trusted

This is a bit of a golden oldie but it still has great potential for messing with your co-workers’ sanity.

All high-level programming languages have rules. These rules are there to protect you, to stop you from really messing things up. Fortunately, Java allows you bypass some of these rules, via the dark art of reflection, and have some fun by changing things that you really shouldn’t be allowed to change.  Your confused colleagues might never trust a machine again. It’s all about the String constants cached by the JVM.

All it takes is a few lines (suitably hidden in some seemingly unrelated class)

java.lang.reflect.Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true);
valueField.set("Hello World!", "Goodbye     ".toCharArray());

then this

System.out.println("Hello World!");

prints this

Goodbye

Escape to Victory

A language feature that appears to exist primarily to break syntax-highlighers

Of course, if your co-workers are made of sterner stuff you may need to be a little more devious to send them insane. They won’t be able to track down the offending code above by searching for “Hello World!”, “Goodbye” or even “setAccessible” if these Strings don’t appear anywhere in the source.

Unicode escape sequences (such as 'u0020') can be used not only to construct character and String literals but can actually be used anywhere in the source that their equivalent characters can be used. So entire statements, even entire classes can be written solely with these escape sequences.

Questionable Parentage

Instantiating inner class objects outside of the parent class

Inner classes (that is non-static nested classes) have an implicit reference to an instance of their containing class, but they don’t necessarily have to be instantiated within the scope of the parent class. Daniel Pitts shows another way. File this one under “ugly syntax” and “surely that’s not supposed to work?”.

Boxing Clever

From stupid Java tricks to stupid Java pitfalls

This one is not a trick as such but it is a good example of Java code that behaves completely unintuitively. The introduction of auto-boxing and auto-unboxing in Java 5 has created a whole new set of opportunities for bugs.

This is my personal favourite:

Integer n = 128;
Integer m = 128;
assert n <= m; // This works.
assert n >= m; // This works.
assert n == m; // This fails.

It gets better. Try changing n and m to both be 127 and it works, which, to be fair, is exactly how it’s supposed to work according to the JLS:

If the value p being boxed is true, false, a byte, a char in the range u0000 to u007f, or an int or short number between -128 and 127, then let r1 and r2 be the results of any two boxing conversions of p. It is always the case that r1 == r2.

But it’s not exactly what you would expect if you hadn’t read the spec.

Any more?

That’s all the Java stupidity I can think of right now. Feel free to add your own stupid Java tricks in the comments.

A Java Syntax Quirk

Posted in Java by Dan on August 24th, 2008

This little trick is shamelessly stolen from Daniele Futtorovic’s post on comp.lang.java.programmer.

This is legal, compilable Java:

public class Oddity
{
    public static void main(String[] args)
    {
        http://blog.uncommons.org
        System.out.println("Why is the URL allowed above?");
    }
}

Why doesn’t the URL being in there upset the compiler?  If you’re not sure why it’s valid, click “show” for a spoiler.

show

Uncommons Maths 1.1: Java Random Number Generators and Mathematical Utility Classes

Posted in Java by Dan on August 22nd, 2008

Uncommons Maths is a set of Java classes for working with random numbers, combinatorics and basic statistics.  The random number package is its most compelling feature, providing three advanced random number generators and support for several probability distributions (see A Java Programmer’s Guide to Random Numbers for more info, or have a play with the demo application).

It’s been a while since the initial release.  During this time I’ve made several minor tweaks and enhancements that I probably ought to share, so I’ve just uploaded a version 1.1 to Java.net.

If you are an existing user, please consult the changelog.  There is one change since version 1.0.2 that may cause backwards compatibility issues.  The combinatorics and number classes have been moved to their own packages to avoid cluttering the org.uncommons.maths base package.  So you made need to update your import statements accordingly (see the API documentation).

Revisiting the Comments Debate: The Self-Documenting Code Contest

Posted in Haskell, Software Development by Dan on August 5th, 2008

The great commenting debate generated a lot of disagreement, both here and elsewhere, about the value of code comments with respect to writing self-explanatory code.  If you are of the opinion that good code does not need comments, here is your chance to prove it.  Laurie Cheers has created The Self-Documenting Code Contest.  The idea is to solve a given problem using the most understandabe code that you can write.  No comments are allowed.  The problem is to generate all two-word anagrams of the word “documenting”.

Although I’ve clearly stated my opinion in favour of comments, I decided to give it a shot.  I’ve already submitted my Haskell solution and, to be honest, you’d do well to improve on its readability.  I believe that it satisfies all of the requirements using a brilliantly simple algorithm.

I’ve hidden my code in case you want to try for yourself first.  Click “show” to reveal my solution: show

UPDATE: I got disqualified from the contest for “being a smartass” 🙁

« Older Posts