New Adventures in Software » 2008

DZone RSS Tricks

Posted in Software Development, The Internet by Dan on July 31st, 2008

A comment from “Dmitryx” on DZone about how DZone could provide better options for filtering its “Popular Links” got me thinking about how Yahoo! Pipes (my new favourite toy) could be used to filter DZone RSS feeds in interesting ways. Helpfully, DZone provides a lot of information in its RSS feeds including the number of votes for and against, the click count, the number of comments, the username of the submitter, the URL of a thumbnail image, a list of tags and more. So if you want to go down the Pipes route, there are a lot of possibilities.

However, something else that is not immediately obvious is that DZone provides a lot of functionality of its own for filtering the feeds that it serves. Most DZone users will be aware that they can subscribe to a feed of “Popular Links” (those that make it to the front page) or subscribe to the “New Links” feed (those that have recently been submitted and have not yet attracted enough votes to qualify for the front page). The respective URLs for these feeds are:

http://www.dzone.com/links/feed/frontpage/rss.xml
http://www.dzone.com/links/feed/queue/rss.xml

But these two feeds are not the only ones available. There are also feeds for each of the tags. If, for example, you want to subscribe only to Python articles you would use one of the following feeds (again they are divided into “Popular” and “New” articles):

http://www.dzone.com/links/feed/frontpage/python/rss.xml
http://www.dzone.com/links/feed/queue/python/rss.xml

This is great if the articles you want neatly fit in to one of the 48 categories that DZone provides, but what if you want to restrict the feed to articles about Haskell, which doesn’t have its own tag (it is lumped together with Lisp, Erlang, Scala and the rest under the “Other Languages” tag)? Fortunately DZone provides a solution for this as well. You can create a feed for any search phrase. A Haskell feed (for both new and popular links) has the following URL:

http://www.dzone.com/links/feed/search/haskell/rss.xml

Kevin Pang has also discovered that you can provide query parameters to DZone’s feed URLs to manipulate the results (although DZone main man Rick Ross warns that these are not guaranteed to be supported in the future).

It’s not just topics that you can filter on. You can also subscribe to a feed of links submitted by a particular user. First you need to find out that user’s numeric ID (it’s part of the URL for their profile page), and then use that to construct the feed URL:

http://www.dzone.com/links/feed/user/<user_id>/rss.xml

Likewise for that user’s shared links:

http://www.dzone.com/links/feed/shared/<user_id>/rss.xml

If these options alone aren’t enough, by using Yahoo! Pipes to combine, sort and filter multiple DZone feeds you should be able to tailor your subscription to match your interests precisely.

Melting Virtual Servers: The Digg Effect

Posted in PHP, The Internet by Dan on July 28th, 2008

I was right about everybody having strong opinions about code commenting. The popularity of my previous post on the subject (perhaps due to its provocative title) brought this site to its knees.

The Calm Before the Storm

I was receiving higher-than-average traffic from the regular sources, such as DZone and JavaBlogs, but this traffic was still pretty trivial and nothing that the server couldn’t handle. The real problem occurred sometime on Sunday evening. While I was busy playing football (we lost 4-0), my post was rapidly gaining attention on Digg.

The entire Uncommons.org empire is hosted on a 256mb VPS (Virtual Private Server) from Slicehost. These 256 megabytes of RAM are shared between Apache, PHP, MySQL and Jetty.Though this is a modest amount of memory, it is sufficient for the traffic that I normally attract. I was not entirely unprepared for a traffic spike. I had installed the WP-Cache plugin for WordPress sometime ago to avoid unnecessary MySQL queries. I’d also tweaked the Apache configuration to make the most of the limited resources available to it (or so I thought – I guess I should take more notice of my own advice about optimisations).

Blissful Ignorance

For about two hours after the post hit the front page of Digg, I was completely oblivious to the pain being inflicted on my poor VPS. It wasn’t until I tried to visit my WordPress dashboard that I realised something wasn’t quite right. The server was not responding very promptly at all. SSH access was still available, though a little slow. First thing I checked was the load average. It was high but not massively so. Next I checked memory usage. Pretty predictably, all of the RAM and all of the swap space was being used.

I only use Jetty for running Hudson and, while it is pretty lightweight for a Java server, it still accounts for a lot of the memory usage on a 256mb slice. So I shut it down to release some resources.

I didn’t have to do much analysis to figure out where the traffic was coming from as my first guess was right; I quickly found my post in the Technology section of Digg. Initially I didn’t realise that it was also on the front page.

Some Figures

The post was on Digg’s front page for just over three and a quarter hours. I served up around 10,000 page impressions to around 6,000 unique visitors in that time. I’ve served over 9,000 pages so far today, at a more steady rate, and continue to receive hundreds of hits per hour. I’ve transferred over 3.5Gb of data in total. Throughout this time the server has stayed up but the site has been very slow to respond and I am sure that many visitors gave up before they got a response (some of the comments on Digg are redirecting readers to cached versions of the page because of this). If the site had been up to it, it probably would have served thousands more pages. The article has received over 2000 “Diggs”, over half of them after it dropped off the front page.

It could have been worse. At least I wasn’t linked to by Yahoo!.

Remedial Action

Even after the worst of the spike the response times were terrible. CPU load was negligible but there was a bottle-neck elsewhere, ultimately down to insufficient RAM.

Searching for potential solutions, one idea I liked a lot was using mod_rewrite to redirect all traffic from digg.com to a Coral cached version of the page. I did set this up, but I couldn’t get Coral to cache the page because its requests to my server were being timed out.

Eventually I decided to replace Apache… with Apache. The default Apache install for Ubuntu is the prefork MPM (multi-processing module) version. Apparently this process-based implementation is the more robust option but it’s considerably more RAM-hungry than the alternative thread-based worker MPM. In my memory-constrained enviroment, the worker MPM seemed worth a punt.

Fortunately, Slicehost makes back-ups ridiculously simple, so I took a one-click snapshot of the server and then installed the worker version of Apache:

sudo apt-get install apache2-mpm-worker

This was more work than I was expecting because it involved uninstalling PHP. To use the worker MPM I would have to run PHP via FastCGI. I found these instructions on how to set that up.

Once I’d finally got the configuration right (the web server was down for about 20 minutes), there was an immediate improvement. The server seems to be dealing with requests very well now, though I still have not restarted Jetty. Whether this configuration would have stood up to the peak force of Digg’s DDOS remains to be seen. I was also considering a switch to Lighttpd, but that would have been a bit more drastic since I’ve no experience with it.

Other Tweaks

The next item on my list is to replace WP-Cache with WP-Super-Cache so that WordPress can serve fully static pages without having to load the PHP engine each time. Other suggestions that I found include optimising the WordPress MySQL tables and installing a PHP compiler cache. Maybe I also need to set-up some kind of monitoring to alert me when there is a huge traffic spike?

No, your code is not so great that it doesn’t need comments

Posted in Software Development by Dan on July 25th, 2008

Code-commenting is so basic and so universal that every programmer, regardless of the language that they practise, thinks that they know all there is to know and that their way is the only sensible approach (I am no different in this respect). I guess that’s why there are so many blog postings offering advice on commenting (you can add this one to the list).

Even A-list programmer bloggers are having their say. Steve Yegge covered it and, more recently, so did Jeff Attwood. Jeff’s basic advice – that you wouldn’t need so many comments if you wrote the code to be more self-explanatory – is sound but the idea that we should be aiming for some kind of perfect code that has no need for any comments is dangerous.

It’s not a sensible goal for beginners and inexperienced developers. Tell them that they should write good code without any comments and they will deliver on the second part but struggle with the first. Even among experienced developers, assuming for a moment that it is possible to write perfect code that doesn’t require comments, there will be far fewer who are capable of this than there are who think that they are.

The other arguments against commenting are even weaker in my opinion. Yes, poor comments are …well… poor. So don’t write poor comments, write good ones. And yes, if comments become out-of-sync with the code then they are not helpful. So don’t let the comments become out-of-sync; they are part of your code and should be maintained/refactored along with the code itself.

I don’t believe that I’ve read a piece of code and thought “wow, this has far too many comments”. Unfortunately, I’ve had the opposite reaction all too often. I don’t for one moment believe that it is possible to write quality code without any comments. Take Jeff’s own example:

Here’s some code with no comments whatsoever:
r = n / 2;
while ( abs( r - (n/r) ) > t ) {
    r = 0.5 * ( r + (n/r) );
}
System.out.println( "r = " + r );
Any idea what that bit of code does? It’s perfectly readable, but what the heck does it do?

Let’s add a comment.
// square root of n with Newton-Raphson approximation
r = n / 2;
while ( abs( r - (n/r) ) > t ) {
    r = 0.5 * ( r + (n/r) );
}
System.out.println( "r = " + r );
That must be what I was getting at, right? Some sort of pleasant, middle-of-the-road compromise between the two polar extremes of no comments whatsoever and carefully formatted epic poems every second line of code?

Not exactly. Rather than add a comment, I’d refactor to this:
private double SquareRootApproximation(n) {
    r = n / 2;
    while ( abs( r - (n/r) ) > t ) {
        r = 0.5 * ( r + (n/r) );
    }
    return r;
}
System.out.println( "r = " + SquareRootApproximation(r) );
I haven’t added a single comment, and yet this mysterious bit of code is now perfectly understandable.

Sorry Jeff, but that’s not “perfectly understandable”. I agree with extracting the square root code into a separate method with an appropriate name but your second version – the one with the comment – was more informative since it mentioned which algorithm you were using. In your version the maintainer is going to have to figure that out for themselves. Also, we’re still left with at least two poorly-named variables. We can forgive the use of n for the parameter since that’s kind of a convention but what the hell are r and t?

In my opinion, this is better:

/**
 * Approximate the square root of n, to within the specified tolerance,
 * using the Newton-Raphson method.
 */
private double approximateSquareRoot(double n, double tolerance)
{
    double root = n / 2;
    while (abs(root - (n / root)) > tolerance)
    {
        root = 0.5 * (root + (n / root));
    }
    return root;
}

Alternatively, if you don’t like the verbose comment at the top, you could either rename the method to something like newtonRaphsonSquareRoot (if you are happy for the method name to be tied to the implementation) or put an inline comment in the body explaining that this is the Newton-Raphson method. Any of the three variations will communicate useful extra information to the maintenance programmer, who can then Google “Newton-Raphson” if they want to find out more about it. Remember that code is written only once but read many times. It should be tailored for the reader rather than the writer.

This is all very well but we’re still lacking some information. Why the hell is Jeff calculating square roots in this way? Why is he not using the library function? Is it because he doesn’t like the answers it gives him? Is it for performance? Who knows?

Well-written code will often answer the “what?” and “how?” questions with few or no comments, but you often also need to answer the “why?” question too. Avi Pilosof covers this in his response to Jeff’s post. Avi argues that rather than comment the code you should comment the business justification for writing the code that way. This may mean inserting reference to particular requirements or issue reports.

So yes, favour code that is self-explanatory but I don’t believe that you can always achieve the necessary clarity without a few well-placed comments to aid understanding. Code that is obvious to the author today is rarely obvious to the maintainer next year (or even to the author next month).

And if you still really believe that your code does not need any comments, then I hope I never have to maintain it.

Fun with Yahoo! Pipes and Last.fm

Posted in Software Development, The Internet by Dan on July 24th, 2008

So I might be about 18 months late, but I finally got around to playing with Yahoo! Pipes today. I was aware of the basic concept but I was not aware of how impressive the implementation is. It’s an incredibly powerful tool with a slick UI that allows you to perform some pretty complex data processing without doing any real programming.

For my first experimental pipe, I just had it aggregate and sort the feed from this blog, my Google Reader shared links feed and my Flickr photostream feed. Easy-peasy.

Things got a bit more interesting when I tried to add my Last.fm “loved tracks” (favourites) to this super DanFeed. This is because Last.fm doesn’t actually have a feed for “loved tracks”. It has a feed for all recently played tracks, but I can’t really see the point of this because, with one entry every 3 or 4 minutes, it’s too much information for even the most dedicated stalker to digest.

Last.fm does however have a REST API to provide access to its data. Yahoo! Pipes is not restricted to processing RSS and Atom feeds. It can also extract data from JSON, CSV, arbritrary XML and even plain old HTML pages, so it didn’t take very long to get at the data I wanted.

After a little bit of trial-and-error I was able to include album art thumbnails in the feed too (for feed-readers that will display them). The only thing that wasn’t intuitive was how Pipes deals with dates for RSS entries. There was a lot of head-scratching before I finally succeeded in getting the dates from the Last.fm XML into the Pipes RSS.

The result of all of this is that I have published my first (semi-)useful Pipe, one that allows you to share your favourite tracks with your friends. In effect, they can subscribe your recommendations. The pipe is here. Just type in a Last.fm username and press the button. You can get a link to your personalised RSS feed from the output page. If you want to embed the feed (including the thumbnail images) on your website/blog/Facebook/whatever, just click on “Get as a Badge” after generating your custom feed.

Optimising Computer Programs for Performance

Posted in Java, Software Development by Dan on July 23rd, 2008

I’ve recently been working on a small Java simulation program that is going to take a long time to execute each time it runs. Basically it does the same thing around a billion times with different random inputs for each iteration. I calculated that for my first working version of the program it would take 22 and a half hours to complete (based on it completing one million iterations in 81 seconds).

This got me thinking about how to optimise the code for performance, which meant revisiting the various rules of optimisation that I’ve learned from my previous programming experiences. So that’s what this post is about: rules of thumb for optimising computer programs for performance (some of this is Java-specific but most of it is generally applicable).

After optimisations, my program will complete in 3 hours and 5 minutes on the same machine (I still have a few ideas left to try that may reduce this further).

1. “Premature optimisation is the root of all evil”
  No discussion of optimisation is complete without somebody inevitably quoting Donald Knuth so let’s get it out of the way up front. Knuth, as usual, is right. Optimisation ahead of time is at best speculative. Furthermore, optimisation is invariably a process of sacrificing readability, portability and general maintainability for performance. It’s better to refrain from making these compromises until it proves to be necessary. More often than not your simple, unoptimised application will be fast enough anyway. Spending time converting your application into a heap of dung in exchange for an unnecessary, and potentially negligible (or even negative), speed boost is not a winning proposition.
2. “There’s a difference between ‘Premature Optimisation’ and ‘Doing things right in the first place'”
  So stated a former colleague of mine in one of his less profane moments. If you’re planning to sort a million records you wouldn’t choose to implement a Bubble Sort. Some things are just wrong from the start. Theo Schlossnagle argues that this ability to effectively determine what constitutes premature optimisation and what is merely common sense is what separates the senior developers from their less experienced colleagues.
3. “You can guess, or you can know”
  If you really understood why your program performs so unacceptably slowly you wouldn’t have written it that way in the first place. So don’t put too much faith in your intuition. If you want to fumble around in the dark in the hope that you’ll eventually guess what you did wrong, go ahead. But if you want to know where you suck at programming ask the computer. A profiler is an essential tool for any optimisation effort. If you’re coding Java JProfiler is an excellent choice. If you want something for nothing the NetBeans Profiler is pretty good too, though not quite as slick. A profiler will quickly identify bottlenecks in your program and the best places to start looking for potential optimisations. Just remember to measure the performance before and after any changes that you make so that you can evaluate their impact.
4. Hardware solutions to software problems
  Your application uses too much memory. You can either lead a crack team of four developers for 5 months and optimise the code until it fits in the available RAM… or you can buy more RAM for less than £50. Ask yourself, what would Wilhelm do? And then do the opposite. In the world of paid-for software development those performance problems that would go away with better hardware are usually best solved by buying better hardware. Even to the extent of replacing entire servers, it can be more cost-effective than non-trivial code changes.
  As well as buying better hardware you should make sure that you are taking full advantage of what is already available to you. My 81-second trial simulation completed in 51 seconds after I split the work between two threads in order to take advantage of my dual core CPU.
5. Optimisations at lower levels are often easier and can have a bigger impact
  The lower the level of the optimisation the more opportunity it provides for improved performance since everything built on top of that layer can take advantage of it. For example, switching to a faster JVM potentially makes all of your classes faster without having to change any of them. In my case I switched from Apple’s Java 5 to the SoyLatte version of Java 6 to take advantage of Sun’s on-going performance work and I got a 20% speed boost without modifying my application. Other improvements in this vein would include upgrading your Linux kernel or replacing a library with a faster implementation (such as switching from Castor XML to JiBX rather than addressing the problem at a higher level by trying to reduce the size of the XML in order to squeeze better performance from Castor).
6. Optimise algorithms not code
  This is where that Computer Science education comes in useful. A basic understanding of complexity theory and big O notation will help you select the best algorithm for the job. A common mistake of inexperienced programmers is to fixate on micro-optimisations. “Maybe if I use direct field access instead of a getter, it will be quicker?” It doesn’t matter. It especially doesn’t matter if your code is slow because you chose an O(n²) algorithm instead of the O(n log n) alternative.
7. Avoid superstition
  This is related to the previous advice. Don’t do something just because someone told you it might be faster or you read it on the Internet. There are dozens of pages of Java performance tips (micro-optimisations mostly) on the web. Most of these tips are well past their sell-by-date. They are irrelevant with modern JVMs (the JIT compiler generally does a better job than misguided hand-optimised code). Some of them were never sensible in the first place. “Make all methods final for performance”, “iterate over arrays backwards because the comparison with zero is cheaper” they say. Yet these superstitious idioms are still religiously applied by some developers incapable of critical thinking. Critical thinking means taking measurements and evaluating for yourself what the impact is.
8. Don’t waste your time.
  The profiler tells you that the two parts of your application consume 95% and 5% of CPU resources respectively. You know that the 5% is far too high and that it should be possible to complete that work in less than 1% of the total time. The problem is, even if you achieve this impressive five-fold performance boost in this part of the code, nobody is going to notice since overall application performance has improved by just 4%. Unless that 4% improvement represents the difference between success and failure it’s not worth the effort. Instead you should be focusing on the other 95% of the application since that is the only place where you might be able to achieve a significant improvement, even if it is more work to do so. My rule of thumb is that for anything less than a 20% improvement it’s generally not worth making my code more dung-like.

Hopefully this has been useful. If you remember only one sentence from this article, make sure it’s this one: “You can guess, or you can know”. Measure everything. Optimisation is science not witchcraft.

Introductory REST articles

Posted in Software Development by Dan on July 8th, 2008

I found Stefan Tilkov’s REST Anti-Patterns post via DZone. It’s the third article in a series of posts on the topic of RESTful applications. Together they serve as a useful introduction.

New Adventures in Software by Dan Dyer