New Adventures in Software » Software Development

Practical Evolutionary Computation: An Introduction

Posted in Evolutionary Computation, Software Development by Dan on January 20th, 2009

Software is normally developed in a very precise, deterministic way. The behaviour of a computer is governed by strict logical rules. A computer invariably does exactly what it is told to do.

When writing a program to solve a particular problem, software developers will identify the necessary sub-tasks that the program must perform. Algorithms are chosen and implemented for each task. The completed program becomes a detailed specification of exactly how to get from A to B. Every aspect is carefully designed by its developers who must understand how the various components interact to deliver the program’s functionality.

This prescriptive approach to solving problems with computers has served us well and is responsible for most of the software applications that we use today. However, it is not without limitations. Solutions to problems are constrained by the intuition, knowledge and prejudices of those who develop the software. The programmers have to know exactly how to solve the problem.

Another characteristic of the prescriptive approach that is sometimes problematic is that it is best suited to finding exact answers. Not all problems have exact solutions, and some that do may be too computationally expensive to solve. Sometimes it is more useful to be able to find an approximate answer quickly than to waste time searching for a better solution.

What are Evolutionary Algorithms?

Evolutionary algorithms (EAs) are inspired by the biological model of evolution and natural selection first proposed by Charles Darwin in 1859. In the natural world, evolution helps species adapt to their environments. Environmental factors that influence the survival prospects of an organism include climate, availability of food and the dangers of predators.

Species change over the course of many generations. Mutations occur randomly. Some mutations will be advantageous, but many will be useless or detrimental. Progress comes from the feedback provided by non-random natural selection. For example, organisms that can survive for long periods without water will be more likely to thrive in dry conditions than those that can’t. Likewise, animals that can run fast will be more successful at evading predators than their slower rivals. If a random genetic modification helps an organism to survive and to reproduce, that modification will itself survive and spread throughout the population, via the organism’s offspring.

Evolutionary algorithms are based on a simplified model of this biological evolution. To solve a particular problem we create an environment in which potential solutions can evolve. The environment is shaped by the parameters of the problem and encourages the evolution of good solutions.

The field of Evolutionary Computation encompasses several types of evolutionary algorithm. These include Genetic Algorithms (GAs), Evolution Strategies, Genetic Programming (GP), Evolutionary Programming and Learning Classifier Systems.

The most common type of evolutionary algorithm is the generational genetic algorithm. The basic outline of a generational GA is as follows (most other EA variants are broadly similar). A population of candidate solutions is iteratively evolved over many generations. Mimicking the concept of natural selection in biology, the survival of candidates (or their offspring) from generation to generation in an EA is governed by a fitness function that evaluates each candidate according to how close it is to the desired outcome, and a selection strategy that favours the better solutions. Over time, the quality of the solutions in the population should improve. If the program is successful, we can terminate the evolution once it has found a solution that is good enough.

An Example

Now that we have introduced the basic concepts and terminology, I will attempt to illustrate by way of an example. Suppose that we want to use evolution to generate a particular character string, for example “HELLO WORLD”. This is a contrived example in as much as it assumes that we don’t know how to create such a string and that evolution is the best approach available to us. However, bear with me as this simple example is useful for demonstrating exactly how the evolutionary approach works.

Each candidate solution in our population will be a string. We’ll use a fixed-length representation so that each string is 11 characters long. Each character in a string will be one of the 27 valid characters (the upper case letters ‘A’ to ‘Z’ plus the space character).

For the fitness function we’ll use the simple approach of assigning a candidate solution one point for each position in the string that has the correct character. For the string “HELLO WORLD” this gives a maximum possible fitness score of 11 (the length of the string).

The first task for the evolutionary algorithm is to randomly generate the initial population. We can use any size population that we choose. Typical EA population sizes can vary from tens to thousands of individuals. For this example we will use a population size of 10. After the initialisation of the population we might have the following candidates (fitness scores in brackets):

  1.  GERZUNFXCEN  (1)
  2.  HSFDAHDMUYZ  (1)
  3.  UQ IGARHGJN  (0)
  4.  ZASIB WSUVP  (2)
  5.  XIXROIUAZBH  (1)
  6.  VDLGCWMBFYA  (1)
  7.  SY YUHYRSEE  (0)
  8.  EUSVBIVFHFK  (0)
  9.  HHENRFZAMZH  (1)
  10. UJBBDFZPLCN  (0)

None of these candidate solutions is particularly good. The best (number 4) has just two characters out of eleven that match the target string (the space character and the ‘W’).

The next step is to select candidates based on their fitness and use them to create a new generation. One technique for favouring the selection of fitter candidates over weaker candidates is to assign each candidate a selection probability proportionate to its fitness.

If we use fitness-proportionate selection, none of the candidates with zero fitness will be selected and the candidate with a fitness of 2 is twice as likely to be selected as any of the candidates with a fitness of 1. For the next step we need to select 10 parents, so it is obvious that some of the fit candidates are going to be selected multiple times.

Now that we have some parents, we can breed the next generation. We do this via a process called cross-over, which is analogous to sexual reproduction in biology. For each pair of parents, a cross-over point is selected randomly. Assuming that the first two randomly selected parents are numbers 2 and 4, if the cross-over occurs after the first four characters, we will get the following offspring:

  Parent 1:     HSFDAHDMUYZ
  Parent 2:     ZASIB WSUVP
  Offspring 1:  HSFDB WSUVP
  Offspring 2:  ZASIAHDMUYZ

This recombination has given us two new candidates for the next generation, one of which is better than either of the parents (offspring 1 has a fitness score of 3). This shows how cross-over can lead towards better solutions. However, looking at the initial population as a whole, we can see that no combination of cross-overs will ever result in a candidate with a fitness higher than 6. This is because, among all 10 original candidates, there are only 6 positions in which we have the correct character.

This can be mitigated to some extent by increasing the size of the population. With 100 individuals in the initial population we would be much more likely to have the necessary building blocks for a perfect solution, but there is no guarantee. This is where mutation comes in.

Mutation is implemented by modifying each character in a string according to some small probability, say 0.02 or 0.05. This means that any single individual will be changed only slightly by mutation, or perhaps not at all.

By applying mutation to each of the offspring produced by cross-over, we will occasionally introduce correct characters in new positions. We will also occasionally remove correct characters but these bad mutations are unlikely to survive selection in the next generation, so this is not a big problem. Advantageous mutations will be propagated by cross-over and selection and will quickly spread throughout the population.

After repeating this process for dozens or perhaps even hundreds of generations we will eventually converge on our desired solution.

This is a convoluted process for finding a string that we already knew to start with. However, as we shall see later, the evolutionary approach generalises to deal with problems where we don’t know what the best solution is and therefore can’t encode that knowledge in our fitness function.

The important point demonstrated by this example is that we can arrive at a satisfactory solution without having to enumerate every possible candidate in the search space. Even for this trivial example, a brute force search would involve generating and checking approximately 5.6 quadrillion strings.

The Outline of an Evolutionary Algorithm

Genesis – Create an initial set (population) of n candidate solutions. This may be done entirely randomly or the population may be seeded with some hand-picked candidates.
Evaluation – Evaluate each member of the population using some fitness function.
Survival of the Fittest – Select a number of members of the evaluated population, favouring those with higher fitness scores. These will be the parents of the next generation.
Evolution – Generate a new population of offspring by randomly altering and/or combining elements of the parent candidates. The evolution is performed by one or more evolutionary operators. The most common operators are cross-over and mutation. Cross-over takes two parents, cuts them each into two or more pieces and recombines the pieces to create two new offspring. Mutation copies an individual but with small, random modifications (such as flipping a bit from zero to one).
Iteration – Repeat steps 2-4 until a satisfactory solution is found or some other termination condition is met (such as the number of generations or elapsed time).

When are Evolutionary Algorithms Useful?

Evolutionary algorithms are typically used to provide good approximate solutions to problems that cannot be solved easily using other techniques. Many optimisation problems fall into this category. It may be too computationally-intensive to find an exact solution but sometimes a near-optimal solution is sufficient. In these situations evolutionary techniques can be effective. Due to their random nature, evolutionary algorithms are never guaranteed to find an optimal solution for any problem, but they will often find a good solution if one exists.

One example of this kind of optimisation problem is the challenge of timetabling. Schools and universities must arrange room and staff allocations to suit the needs of their curriculum. There are several constraints that must be satisfied. A member of staff can only be in one place at a time, they can only teach classes that are in their area of expertise, rooms cannot host lessons if they are already occupied, and classes must not clash with other classes taken by the same students. This is a combinatorial problem and known to be NP-Hard. It is not feasible to exhaustively search for the optimal timetable due to the huge amount of computation involved. Instead, heuristics must be used. Genetic algorithms have proven to be a successful way of generating satisfactory solutions to many scheduling problems.

Evolutionary algorithms can also be used to tackle problems that humans don’t really know how to solve. An EA, free of any human preconceptions or biases, can generate surprising solutions that are comparable to, or better than, the best human-generated efforts. It is merely necessary that we can recognise a good solution if it were presented to us, even if we don’t know how to create a good solution. In other words, we need to be able to formulate an effective fitness function.

NASA ESG evolved antenna. Engineers working for NASA know a lot about physics. They know exactly which characteristics make for a good communications antenna. But the process of designing an antenna so that it has the necessary properties is hard. Even though the engineers know what is required from the final antenna, they may not know how to design the antenna so that it satisfies those requirements.

NASA’s Evolvable Systems Group has used evolutionary algorithms to successfully evolve antennas for use on satellites. These evolved antennas (pictured) have irregular shapes with no obvious symmetry. It is unlikely that a human expert would have arrived at such an unconventional design. Despite this, when tested these antennas proved to be extremely well adapted to their purpose.

Other Examples of Evolutionary Computation in Action

Evolving the Mona Lisa. How well can you approximate Leonard da Vinci’s Mona Lisa using only 50 polygons?
Evolving a buggy to ride a randomly-generated landscape (Flash movie).
Evolving clocks (video).
Evolving Lego bridges.
Solving Sudoku with evolution (Java applet).

Pre-requisites

There are two requirements that must be met before an evolutionary algorithm can be used for a particular problem. Firstly, we need a way to encode candidate solutions to the problem. The simplest encoding, and that used by many genetic algorithms, is a bit string. Each candidate is simply a sequence of zeros and ones. This encoding makes cross-over and mutation very straightforward, but that does not mean that you cannot use more complicated representations. In fact, most of the examples listed in the previous section used more sophisticated candidate representations. As long as we can devise a scheme for evolving the candidates, there really is no restriction on the types that we can use. Genetic programming (GP) is a good example of this. GP evolves computer programs represented as syntax trees.

The second requirement for applying evolutionary algorithms is that there must be a way of evaluating partial solutions to the problem – the fitness function. It is not sufficient to evaluate solutions as right or wrong, the fitness score needs to indicate how right or, if your glass is half empty, how wrong a candidate solution is. So a function that returns either 0 or 1 is useless. A function that returns a score on a scale of 1 – 100 is better. We need shades of grey, not just black and white, since this is how the algorithm guides the random evolution to find increasingly better solutions.

This is the first in a short series of articles on practical Evolutionary Computation. The text is taken from the work-in-progress documentation for the Watchmaker Framework for Evolutionary Computation. The next article will demonstrate how to implement evolutionary algorithms in Java using the Watchmaker Framework.

The Value of a Degree

Posted in Software Development by Dan on December 31st, 2008

Bill the Lizard (if that is his real name) wrote an interesting post revisiting the perennial debate of whether a formal Computer Science education is worthwhile for programmers or not. Bill makes several good points in the post and the comments. I’m paraphrasing here but he basically accuses self-taught programmers who dismiss a university education of arguing from a position of ignorance. If you haven’t experienced it for yourself, how do you know it wouldn’t have been beneficial?

“Education: that which reveals to the wise, and conceals from the stupid, the vast limits of their knowledge.” – Mark Twain

There are comments, both in the Reddit discussion that Bill references and in response to his own article, that suggest that a CS degree is actually an indicator of a poor programmer. As CS graduate myself, I cannot accept this hypothesis. I’ll accept that whether or not a person has a degree is not a reliable indicator of programming aptitude but I would be stunned if there was not at least some positive correlation between formal education and performance. There will always be exceptions that buck the trend. I’ve worked with some excellent developers who have not been to university and I’ve worked with people who have the degree but don’t know how to use it.

Self-learning is a vital skill for a programmer. Even if you’ve got your degree, you can’t stop learning there if you are going to continue to be useful. I do believe that it is possible to learn through self study anything that you could learn at university, but the problem with a home-brew education is that the teacher is as ignorant as the student. You tend to learn whatever you need to know to solve the current problem. It’s a piecemeal approach. Over time you’ll learn lots but the danger is that, because you don’t know what you don’t know, there may be blindspots in your knowledge. A good university course will be structured to teach what you need to know, even if it seems irrelevant at the time. It will provide a broader education, introducing concepts and programming paradigms that may not seem particularly useful but which all contribute to building a deeper understanding of the field.

The vital word in the preceding paragraph is the one emphasised: “good”. All of this debate ignores the crucial fact that degrees are not all equal. There are good degrees, bad degrees and many points in between. This fact is perhaps under-appreciated by those who have not been through the university process. If we could factor in the quality of the degree it would make for a more reliable indicator of developer aptitude.

Hiring Graduates

If you are responsible for hiring programmers you should familiarise yourself with which universities rate best for computer science education (this list may be quite different from the overall list of top universities). Something else to consider is the content of the course. The clue is normally in the name. If it’s not called just “Computer Science” or “Software Engineering” beware. It may be watered down by some other subject or it may be called something like “Management Information Systems”, which might suggest that more time was spent writing essays than writing code.

Q: What do you call somebody who graduates bottom of their class at medical school?

A: Doctor.

Perhaps the biggest mistake when considering the value of a degree as an indicator of programmer aptitude is treating it as a binary proposition: any degree = good, no degree = bad. This is simplistic and wrong. Getting a degree is easy as long as you can last the distance. Here in the UK, many universities will award a third class honours degree for an overall grade of 40%. In fact, you can get a pass degree (no honours) with just 35%. Think about that for a while before calling them in for an interview. Over the course of 3 or 4 years, almost two thirds of everything that person did they got wrong and they have the certificate to prove it.

For senior developers degrees are mostly irrelevant since they will have copious real world experience on which they can be judged but being able to judge the worth of a degree is useful when hiring junior developers. All else being equal, somebody who graduated top of their Computer Science class at MIT will be a better bet than somebody who has a third class degree in HTML Programming from the South Staines Internet University.

More thoughts on Stackoverflow.com

Posted in Software Development, The Internet by Dan on September 26th, 2008

Since my previous post on the subject, Stackoverflow.com has moved from private beta to public beta. I’ve had more time to use the site and have some more thoughts. The criticisms here are meant to be constructive. Hopefully the feedback from users will help the Stackoverflow team to make a good site even better.

Performance

First the good news. The site has transitioned from private to public very well. Jeff and his team seem to have got it right in terms of architecture and infrastructure because, even with the increased load, it remains blindingly fast.

Front Page

In terms of usability, I think there’s more that could be done to help me find the content that I’m interested in. The default front page is, to be honest, not very useful. New questions are coming in so fast and on so many topics that displaying the most recent questions is just noise.

I would prefer to have a personalised home page that shows me relevant questions based on my previous answering/voting history. I realise that this is major new functionality and I’m not criticising the Stackoverflow team for not having this in the initial version, it makes sense to get the site up and running first. However, it would be great if this could be implemented at some point. I’m not alone on this one, it’s the second most popular requested feature at the moment.

Presently I’m finding stuff that I want to look at by going to the tags page and clicking on interesting topics. But I’m sure I’m missing out on questions that would be of interest if only I could find them.

Voting and Reputation

I outlined my concerns on the voting mechanism previously. In the interests of being constructive, rather than just a whiny blogger, I’ve opened new issues on the Stackoverflow Uservoice page. If you agree with me, please vote on these issues:

Don’t allow users to answer and vote on the same question (and remove the -1 penalty for down-voting)
Make down votes equivalent to up votes

Addressing each of these will help in resolving The Fastest Gun in the West Problem (currently the number one voted-on issue). The problem is that early answers get the votes and later, better answers are largely ignored. Removing the penalty for down-voting will encourage more down votes where they are deserved (so an early answer that is later shown to be wrong is less likely to retain a high score). Also, if a down vote was as powerful as an up vote, people might be more careful in crafting good answers as opposed to quick answers.

Source Control and Backups – More than just a good idea

Posted in Software Development by Dan on September 25th, 2008

Are there really software development teams out there that don’t use any form of proper source control at all, even the bad kind? I’d like to think that it wasn’t the case but I’m not so naive.

There’s a reason that “Do you use source control?” is the first question on the Joel Test. It’s because it’s the most important. If you answer “no” to this question you shouldn’t be allowed to answer subsequent questions. Even if the rest of your process is perfect, you score zero. You failed at software development. I could say that if your team doesn’t use source control it is a disaster waiting to happen, but more likely the disaster already happened and you haven’t noticed yet.

Of course, you and I aren’t nearly dumb enough to try developing anything more complex than “Hello World” without version control in place. I’m sure I’m preaching to the converted. The kind of people who read obscure software development blogs probably already know a few things about effective software development.

But how good are your back-ups?

You do have a back-up, don’t you?

If you don’t have a back-up you are one accidental key-stroke or one hardware failure away from scoring zero on the Joel Test (under my rules)… and failing at software development. Hardware will fail, people will screw-up, disgruntled former employees will set fire to the building. None of these is a problem but a failure to anticipate and prepare is.

How often do you back-up?

There is only one right answer to this: every day. Weekly back-ups are too costly. Can you really afford to have your whole team redo an entire week’s work? The first time you lose a week’s work you will switch to daily back-ups, so why not just do it now?

A melted back-up is no back-up at all

Off-Site Storage. You could physically take tapes to another location or you could upload files to a remote server. Just don’t leave them here.

Does it actually work?

Honestly, have you ever tried restoring your source control back-up onto a different machine? The most comprehensive back-up plan imaginable is useless if you can’t restore the back-ups. If you haven’t seen it working (recently) then it doesn’t work. There’s a good time and a bad time to find out that your back-ups don’t work. 15 minutes after your source control server spontaneously combusted is the bad time.

Are you still here? You should be checking those back-up tapes…

UPDATE: The good people of Stackoverflow are discussing what could possibly be a good excuse for not using source control.

Stackoverflow.com – First Impressions

Posted in Software Development, The Internet by Dan on September 12th, 2008

Over the last few days I’ve been playing with the beta of Stack Overflow. In case you are unaware, Stack Overflow is a joint venture between Jeff Atwood of Coding Horror and Joel Spolsky of Joel on Software fame. It’s basically a question and answers site for software developers. A mixture of Experts Exchange, Proggit and Wikipedia. The site is scheduled to come out of beta on Monday when it will open its doors to everyone.

From initial impressions I think it’s fair to say that the site will be a success, initially at least. Being A-list bloggers (and now podcasters too), Jeff and Joel have been able to generate a lot of exposure for their project.

Like Jeff’s blog, the minimalist site design is clean and bold and so far the whole system is very responsive (we’ll see if that’s still the case when the traffic spikes on Monday). The beta audience are already posting thousands of questions, almost all of which generate extremely prompt answers (of varying quality).

However, I think the site suffers a little from the ambitions of trying to be too many different things; is it a programming forum, or is it a Wiki?. There are a lot of different ideas in the implementation that interact via a quite complicated set of rules that have evolved over the course of the beta.

Reputation & Badges

Stack Overflow has two mechanisms for measuring a user’s standing within the community. Firstly, each user has a reputation score. This starts at 1 and increases as you make positive contributions (posting questions and answers that get voted up). As you reach various milestones you get more privileges within the community, such as being able to vote on answers or tag other people’s questions.

Your reputation can be diminished if you get voted down or reported for abuse, but it can’t go below 1 and on the whole it’s heavily biased in favour of upward movement.

The second incentive for users to contribute is the ability to collect “badges”. This works exactly like the Cub Scouts. Some badges are easy to achieve (just fill in your profile or post your first question), and others are much harder to obtain (get 100 up votes for one of your answers).

Voting

Voting is one area of the site that I think could do with an overhaul. It’s unbalanced and not transparent enough. If your answer gets voted up you gain 10 points of reputation. But if your answer gets voted down you only lose 2 points. So if you post something that sounds plausible to the uninformed masses but is actually wrong, you could get 5 up votes and 6 down votes for a net score of -1 yet still gain a 38-point reputation boost. An up vote should have equal weight to a down vote, just like on DZone or Slashdot. It also might be better to show both the number of up votes and the number of down votes (as on DZone) rather than just the net total. This would make it easier to identify controversial content (something with 10 down votes and 12 up votes is not quite the same as something with no down votes and 2 up votes).

Another problem with the voting is that down votes penalise the voter as well as the user whose answer is being voted on. So if you post something wrong like “Java passes objects by reference”, I can either ignore it or lose 1 point of reputation for giving you the down vote that you deserve (even then it will take five of us to fully cancel out the one up vote that you got from someone who didn’t know better).

When I queried the justification for penalising down-voters I was told that it was to combat attempts to game the system. Apparently, earlier in the beta, users were posting answers to questions and then voting down everybody else’s answers so that their answer would appear at the top. The idea was that by making users pay to vote down this behaviour would be discouraged. A better solution to this problem would have been to remove the conflict of interest by not allowing users to answer and vote on the same question (which is how Slashdot’s mod points work), rather than punishing all down votes across the whole site.

The net effect of this voting system is that everybody’s reputation increases pretty quickly. Beyond the minimum score required to get full privileges the numbers can become meaningless. To avoid having to rename the site Integer Overflow there are a couple of artifical limits that restrict the number of votes you can cast and the number of reputation points you can earn each day.

Other Thoughts

Aside from my reservations about the voting, my impressions of Stackoverflow are mostly positive. The fact that it has already attracted hundreds of enthusiastic participants suggests that it has genuinely found a niche. However, I do feel that it is probably more elaborate than it needs to be (I don’t really get the need for the Wiki functionality).

Revisiting the Comments Debate: The Self-Documenting Code Contest

Posted in Haskell, Software Development by Dan on August 5th, 2008

The great commenting debate generated a lot of disagreement, both here and elsewhere, about the value of code comments with respect to writing self-explanatory code. If you are of the opinion that good code does not need comments, here is your chance to prove it. Laurie Cheers has created The Self-Documenting Code Contest. The idea is to solve a given problem using the most understandabe code that you can write. No comments are allowed. The problem is to generate all two-word anagrams of the word “documenting”.

Although I’ve clearly stated my opinion in favour of comments, I decided to give it a shot. I’ve already submitted my Haskell solution and, to be honest, you’d do well to improve on its readability. I believe that it satisfies all of the requirements using a brilliantly simple algorithm.

I’ve hidden my code in case you want to try for yourself first. Click “show” to reveal my solution: show

UPDATE: I got disqualified from the contest for “being a smartass” 🙁

DZone RSS Tricks

Posted in Software Development, The Internet by Dan on July 31st, 2008

A comment from “Dmitryx” on DZone about how DZone could provide better options for filtering its “Popular Links” got me thinking about how Yahoo! Pipes (my new favourite toy) could be used to filter DZone RSS feeds in interesting ways. Helpfully, DZone provides a lot of information in its RSS feeds including the number of votes for and against, the click count, the number of comments, the username of the submitter, the URL of a thumbnail image, a list of tags and more. So if you want to go down the Pipes route, there are a lot of possibilities.

However, something else that is not immediately obvious is that DZone provides a lot of functionality of its own for filtering the feeds that it serves. Most DZone users will be aware that they can subscribe to a feed of “Popular Links” (those that make it to the front page) or subscribe to the “New Links” feed (those that have recently been submitted and have not yet attracted enough votes to qualify for the front page). The respective URLs for these feeds are:

http://www.dzone.com/links/feed/frontpage/rss.xml
http://www.dzone.com/links/feed/queue/rss.xml

But these two feeds are not the only ones available. There are also feeds for each of the tags. If, for example, you want to subscribe only to Python articles you would use one of the following feeds (again they are divided into “Popular” and “New” articles):

http://www.dzone.com/links/feed/frontpage/python/rss.xml
http://www.dzone.com/links/feed/queue/python/rss.xml

This is great if the articles you want neatly fit in to one of the 48 categories that DZone provides, but what if you want to restrict the feed to articles about Haskell, which doesn’t have its own tag (it is lumped together with Lisp, Erlang, Scala and the rest under the “Other Languages” tag)? Fortunately DZone provides a solution for this as well. You can create a feed for any search phrase. A Haskell feed (for both new and popular links) has the following URL:

http://www.dzone.com/links/feed/search/haskell/rss.xml

Kevin Pang has also discovered that you can provide query parameters to DZone’s feed URLs to manipulate the results (although DZone main man Rick Ross warns that these are not guaranteed to be supported in the future).

It’s not just topics that you can filter on. You can also subscribe to a feed of links submitted by a particular user. First you need to find out that user’s numeric ID (it’s part of the URL for their profile page), and then use that to construct the feed URL:

http://www.dzone.com/links/feed/user/<user_id>/rss.xml

Likewise for that user’s shared links:

http://www.dzone.com/links/feed/shared/<user_id>/rss.xml

If these options alone aren’t enough, by using Yahoo! Pipes to combine, sort and filter multiple DZone feeds you should be able to tailor your subscription to match your interests precisely.

No, your code is not so great that it doesn’t need comments

Posted in Software Development by Dan on July 25th, 2008

Code-commenting is so basic and so universal that every programmer, regardless of the language that they practise, thinks that they know all there is to know and that their way is the only sensible approach (I am no different in this respect). I guess that’s why there are so many blog postings offering advice on commenting (you can add this one to the list).

Even A-list programmer bloggers are having their say. Steve Yegge covered it and, more recently, so did Jeff Attwood. Jeff’s basic advice – that you wouldn’t need so many comments if you wrote the code to be more self-explanatory – is sound but the idea that we should be aiming for some kind of perfect code that has no need for any comments is dangerous.

It’s not a sensible goal for beginners and inexperienced developers. Tell them that they should write good code without any comments and they will deliver on the second part but struggle with the first. Even among experienced developers, assuming for a moment that it is possible to write perfect code that doesn’t require comments, there will be far fewer who are capable of this than there are who think that they are.

The other arguments against commenting are even weaker in my opinion. Yes, poor comments are …well… poor. So don’t write poor comments, write good ones. And yes, if comments become out-of-sync with the code then they are not helpful. So don’t let the comments become out-of-sync; they are part of your code and should be maintained/refactored along with the code itself.

I don’t believe that I’ve read a piece of code and thought “wow, this has far too many comments”. Unfortunately, I’ve had the opposite reaction all too often. I don’t for one moment believe that it is possible to write quality code without any comments. Take Jeff’s own example:

Here’s some code with no comments whatsoever:
r = n / 2;
while ( abs( r - (n/r) ) > t ) {
    r = 0.5 * ( r + (n/r) );
}
System.out.println( "r = " + r );
Any idea what that bit of code does? It’s perfectly readable, but what the heck does it do?

Let’s add a comment.
// square root of n with Newton-Raphson approximation
r = n / 2;
while ( abs( r - (n/r) ) > t ) {
    r = 0.5 * ( r + (n/r) );
}
System.out.println( "r = " + r );
That must be what I was getting at, right? Some sort of pleasant, middle-of-the-road compromise between the two polar extremes of no comments whatsoever and carefully formatted epic poems every second line of code?

Not exactly. Rather than add a comment, I’d refactor to this:
private double SquareRootApproximation(n) {
    r = n / 2;
    while ( abs( r - (n/r) ) > t ) {
        r = 0.5 * ( r + (n/r) );
    }
    return r;
}
System.out.println( "r = " + SquareRootApproximation(r) );
I haven’t added a single comment, and yet this mysterious bit of code is now perfectly understandable.

Sorry Jeff, but that’s not “perfectly understandable”. I agree with extracting the square root code into a separate method with an appropriate name but your second version – the one with the comment – was more informative since it mentioned which algorithm you were using. In your version the maintainer is going to have to figure that out for themselves. Also, we’re still left with at least two poorly-named variables. We can forgive the use of n for the parameter since that’s kind of a convention but what the hell are r and t?

In my opinion, this is better:

/**
 * Approximate the square root of n, to within the specified tolerance,
 * using the Newton-Raphson method.
 */
private double approximateSquareRoot(double n, double tolerance)
{
    double root = n / 2;
    while (abs(root - (n / root)) > tolerance)
    {
        root = 0.5 * (root + (n / root));
    }
    return root;
}

Alternatively, if you don’t like the verbose comment at the top, you could either rename the method to something like newtonRaphsonSquareRoot (if you are happy for the method name to be tied to the implementation) or put an inline comment in the body explaining that this is the Newton-Raphson method. Any of the three variations will communicate useful extra information to the maintenance programmer, who can then Google “Newton-Raphson” if they want to find out more about it. Remember that code is written only once but read many times. It should be tailored for the reader rather than the writer.

This is all very well but we’re still lacking some information. Why the hell is Jeff calculating square roots in this way? Why is he not using the library function? Is it because he doesn’t like the answers it gives him? Is it for performance? Who knows?

Well-written code will often answer the “what?” and “how?” questions with few or no comments, but you often also need to answer the “why?” question too. Avi Pilosof covers this in his response to Jeff’s post. Avi argues that rather than comment the code you should comment the business justification for writing the code that way. This may mean inserting reference to particular requirements or issue reports.

So yes, favour code that is self-explanatory but I don’t believe that you can always achieve the necessary clarity without a few well-placed comments to aid understanding. Code that is obvious to the author today is rarely obvious to the maintainer next year (or even to the author next month).

And if you still really believe that your code does not need any comments, then I hope I never have to maintain it.

Fun with Yahoo! Pipes and Last.fm

Posted in Software Development, The Internet by Dan on July 24th, 2008

So I might be about 18 months late, but I finally got around to playing with Yahoo! Pipes today. I was aware of the basic concept but I was not aware of how impressive the implementation is. It’s an incredibly powerful tool with a slick UI that allows you to perform some pretty complex data processing without doing any real programming.

For my first experimental pipe, I just had it aggregate and sort the feed from this blog, my Google Reader shared links feed and my Flickr photostream feed. Easy-peasy.

Things got a bit more interesting when I tried to add my Last.fm “loved tracks” (favourites) to this super DanFeed. This is because Last.fm doesn’t actually have a feed for “loved tracks”. It has a feed for all recently played tracks, but I can’t really see the point of this because, with one entry every 3 or 4 minutes, it’s too much information for even the most dedicated stalker to digest.

Last.fm does however have a REST API to provide access to its data. Yahoo! Pipes is not restricted to processing RSS and Atom feeds. It can also extract data from JSON, CSV, arbritrary XML and even plain old HTML pages, so it didn’t take very long to get at the data I wanted.

After a little bit of trial-and-error I was able to include album art thumbnails in the feed too (for feed-readers that will display them). The only thing that wasn’t intuitive was how Pipes deals with dates for RSS entries. There was a lot of head-scratching before I finally succeeded in getting the dates from the Last.fm XML into the Pipes RSS.

The result of all of this is that I have published my first (semi-)useful Pipe, one that allows you to share your favourite tracks with your friends. In effect, they can subscribe your recommendations. The pipe is here. Just type in a Last.fm username and press the button. You can get a link to your personalised RSS feed from the output page. If you want to embed the feed (including the thumbnail images) on your website/blog/Facebook/whatever, just click on “Get as a Badge” after generating your custom feed.

Optimising Computer Programs for Performance

Posted in Java, Software Development by Dan on July 23rd, 2008

I’ve recently been working on a small Java simulation program that is going to take a long time to execute each time it runs. Basically it does the same thing around a billion times with different random inputs for each iteration. I calculated that for my first working version of the program it would take 22 and a half hours to complete (based on it completing one million iterations in 81 seconds).

This got me thinking about how to optimise the code for performance, which meant revisiting the various rules of optimisation that I’ve learned from my previous programming experiences. So that’s what this post is about: rules of thumb for optimising computer programs for performance (some of this is Java-specific but most of it is generally applicable).

After optimisations, my program will complete in 3 hours and 5 minutes on the same machine (I still have a few ideas left to try that may reduce this further).

1. “Premature optimisation is the root of all evil”
  No discussion of optimisation is complete without somebody inevitably quoting Donald Knuth so let’s get it out of the way up front. Knuth, as usual, is right. Optimisation ahead of time is at best speculative. Furthermore, optimisation is invariably a process of sacrificing readability, portability and general maintainability for performance. It’s better to refrain from making these compromises until it proves to be necessary. More often than not your simple, unoptimised application will be fast enough anyway. Spending time converting your application into a heap of dung in exchange for an unnecessary, and potentially negligible (or even negative), speed boost is not a winning proposition.
2. “There’s a difference between ‘Premature Optimisation’ and ‘Doing things right in the first place'”
  So stated a former colleague of mine in one of his less profane moments. If you’re planning to sort a million records you wouldn’t choose to implement a Bubble Sort. Some things are just wrong from the start. Theo Schlossnagle argues that this ability to effectively determine what constitutes premature optimisation and what is merely common sense is what separates the senior developers from their less experienced colleagues.
3. “You can guess, or you can know”
  If you really understood why your program performs so unacceptably slowly you wouldn’t have written it that way in the first place. So don’t put too much faith in your intuition. If you want to fumble around in the dark in the hope that you’ll eventually guess what you did wrong, go ahead. But if you want to know where you suck at programming ask the computer. A profiler is an essential tool for any optimisation effort. If you’re coding Java JProfiler is an excellent choice. If you want something for nothing the NetBeans Profiler is pretty good too, though not quite as slick. A profiler will quickly identify bottlenecks in your program and the best places to start looking for potential optimisations. Just remember to measure the performance before and after any changes that you make so that you can evaluate their impact.
4. Hardware solutions to software problems
  Your application uses too much memory. You can either lead a crack team of four developers for 5 months and optimise the code until it fits in the available RAM… or you can buy more RAM for less than £50. Ask yourself, what would Wilhelm do? And then do the opposite. In the world of paid-for software development those performance problems that would go away with better hardware are usually best solved by buying better hardware. Even to the extent of replacing entire servers, it can be more cost-effective than non-trivial code changes.
  As well as buying better hardware you should make sure that you are taking full advantage of what is already available to you. My 81-second trial simulation completed in 51 seconds after I split the work between two threads in order to take advantage of my dual core CPU.
5. Optimisations at lower levels are often easier and can have a bigger impact
  The lower the level of the optimisation the more opportunity it provides for improved performance since everything built on top of that layer can take advantage of it. For example, switching to a faster JVM potentially makes all of your classes faster without having to change any of them. In my case I switched from Apple’s Java 5 to the SoyLatte version of Java 6 to take advantage of Sun’s on-going performance work and I got a 20% speed boost without modifying my application. Other improvements in this vein would include upgrading your Linux kernel or replacing a library with a faster implementation (such as switching from Castor XML to JiBX rather than addressing the problem at a higher level by trying to reduce the size of the XML in order to squeeze better performance from Castor).
6. Optimise algorithms not code
  This is where that Computer Science education comes in useful. A basic understanding of complexity theory and big O notation will help you select the best algorithm for the job. A common mistake of inexperienced programmers is to fixate on micro-optimisations. “Maybe if I use direct field access instead of a getter, it will be quicker?” It doesn’t matter. It especially doesn’t matter if your code is slow because you chose an O(n²) algorithm instead of the O(n log n) alternative.
7. Avoid superstition
  This is related to the previous advice. Don’t do something just because someone told you it might be faster or you read it on the Internet. There are dozens of pages of Java performance tips (micro-optimisations mostly) on the web. Most of these tips are well past their sell-by-date. They are irrelevant with modern JVMs (the JIT compiler generally does a better job than misguided hand-optimised code). Some of them were never sensible in the first place. “Make all methods final for performance”, “iterate over arrays backwards because the comparison with zero is cheaper” they say. Yet these superstitious idioms are still religiously applied by some developers incapable of critical thinking. Critical thinking means taking measurements and evaluating for yourself what the impact is.
8. Don’t waste your time.
  The profiler tells you that the two parts of your application consume 95% and 5% of CPU resources respectively. You know that the 5% is far too high and that it should be possible to complete that work in less than 1% of the total time. The problem is, even if you achieve this impressive five-fold performance boost in this part of the code, nobody is going to notice since overall application performance has improved by just 4%. Unless that 4% improvement represents the difference between success and failure it’s not worth the effort. Instead you should be focusing on the other 95% of the application since that is the only place where you might be able to achieve a significant improvement, even if it is more work to do so. My rule of thumb is that for anything less than a 20% improvement it’s generally not worth making my code more dung-like.

Hopefully this has been useful. If you remember only one sentence from this article, make sure it’s this one: “You can guess, or you can know”. Measure everything. Optimisation is science not witchcraft.

« Older Posts

New Adventures in Software by Dan Dyer