New Adventures in Software

WordPress Headaches: Caching and FeedBurner

Posted in PHP, The Internet by Dan on August 1st, 2008

Still suffering a few teething problems following my attempts to Digg-proof this blog. It seems that WP Super Cache, in its default configuration at least, is incompatible with FeedBurner and its WordPress plug-in. So if you’ve been having trouble accessing my feed, this is why (for some reason it has been serving up the front page HTML instead of the feed RSS).

I tried a few things to fix the problem. Each time it seemed to be working for a while but it soon went wrong again. Rather than waste time figuring out what exactly is going wrong, I’ve switched back to WP Cache.

I have to agree with Jeff Atwood that caching really ought to be core functionality for a blog publishing platform like WordPress. Then we wouldn’t have to mess around configuring different plug-ins and trying to get them to play nicely together.

In terms of functionality, WordPress still appears to be the best option for self-hosted blogging but it’s not without its annoyances. If I were to switch from WordPress to something else, these are some of the features I would like to see:

Built-in page caching.
Support for multiple blogs with a single installation of the software (with WordPress, you have to use a different branch of the software to achieve this).
Support for databases other than MySQL (PostgreSQL as a minimum, but really any mainstream SQL database should be usable).
A better approach to themes (I shouldn’t have to write PHP to develop themes. With appropriately structured pages, I could probably achieve everything that I want using just CSS).
Integrated support for popular advertising services such as Adsense (I shouldn’t have to cut-and-paste JavaScript into PHP files).
Ability to import posts and comments from WordPress.

Does such a platform exist, or will I have to write it myself?

DZone RSS Tricks

Posted in Software Development, The Internet by Dan on July 31st, 2008

A comment from “Dmitryx” on DZone about how DZone could provide better options for filtering its “Popular Links” got me thinking about how Yahoo! Pipes (my new favourite toy) could be used to filter DZone RSS feeds in interesting ways. Helpfully, DZone provides a lot of information in its RSS feeds including the number of votes for and against, the click count, the number of comments, the username of the submitter, the URL of a thumbnail image, a list of tags and more. So if you want to go down the Pipes route, there are a lot of possibilities.

However, something else that is not immediately obvious is that DZone provides a lot of functionality of its own for filtering the feeds that it serves. Most DZone users will be aware that they can subscribe to a feed of “Popular Links” (those that make it to the front page) or subscribe to the “New Links” feed (those that have recently been submitted and have not yet attracted enough votes to qualify for the front page). The respective URLs for these feeds are:

http://www.dzone.com/links/feed/frontpage/rss.xml
http://www.dzone.com/links/feed/queue/rss.xml

But these two feeds are not the only ones available. There are also feeds for each of the tags. If, for example, you want to subscribe only to Python articles you would use one of the following feeds (again they are divided into “Popular” and “New” articles):

http://www.dzone.com/links/feed/frontpage/python/rss.xml
http://www.dzone.com/links/feed/queue/python/rss.xml

This is great if the articles you want neatly fit in to one of the 48 categories that DZone provides, but what if you want to restrict the feed to articles about Haskell, which doesn’t have its own tag (it is lumped together with Lisp, Erlang, Scala and the rest under the “Other Languages” tag)? Fortunately DZone provides a solution for this as well. You can create a feed for any search phrase. A Haskell feed (for both new and popular links) has the following URL:

http://www.dzone.com/links/feed/search/haskell/rss.xml

Kevin Pang has also discovered that you can provide query parameters to DZone’s feed URLs to manipulate the results (although DZone main man Rick Ross warns that these are not guaranteed to be supported in the future).

It’s not just topics that you can filter on. You can also subscribe to a feed of links submitted by a particular user. First you need to find out that user’s numeric ID (it’s part of the URL for their profile page), and then use that to construct the feed URL:

http://www.dzone.com/links/feed/user/<user_id>/rss.xml

Likewise for that user’s shared links:

http://www.dzone.com/links/feed/shared/<user_id>/rss.xml

If these options alone aren’t enough, by using Yahoo! Pipes to combine, sort and filter multiple DZone feeds you should be able to tailor your subscription to match your interests precisely.

Melting Virtual Servers: The Digg Effect

Posted in PHP, The Internet by Dan on July 28th, 2008

I was right about everybody having strong opinions about code commenting. The popularity of my previous post on the subject (perhaps due to its provocative title) brought this site to its knees.

The Calm Before the Storm

I was receiving higher-than-average traffic from the regular sources, such as DZone and JavaBlogs, but this traffic was still pretty trivial and nothing that the server couldn’t handle. The real problem occurred sometime on Sunday evening. While I was busy playing football (we lost 4-0), my post was rapidly gaining attention on Digg.

The entire Uncommons.org empire is hosted on a 256mb VPS (Virtual Private Server) from Slicehost. These 256 megabytes of RAM are shared between Apache, PHP, MySQL and Jetty.Though this is a modest amount of memory, it is sufficient for the traffic that I normally attract. I was not entirely unprepared for a traffic spike. I had installed the WP-Cache plugin for WordPress sometime ago to avoid unnecessary MySQL queries. I’d also tweaked the Apache configuration to make the most of the limited resources available to it (or so I thought – I guess I should take more notice of my own advice about optimisations).

Blissful Ignorance

For about two hours after the post hit the front page of Digg, I was completely oblivious to the pain being inflicted on my poor VPS. It wasn’t until I tried to visit my WordPress dashboard that I realised something wasn’t quite right. The server was not responding very promptly at all. SSH access was still available, though a little slow. First thing I checked was the load average. It was high but not massively so. Next I checked memory usage. Pretty predictably, all of the RAM and all of the swap space was being used.

I only use Jetty for running Hudson and, while it is pretty lightweight for a Java server, it still accounts for a lot of the memory usage on a 256mb slice. So I shut it down to release some resources.

I didn’t have to do much analysis to figure out where the traffic was coming from as my first guess was right; I quickly found my post in the Technology section of Digg. Initially I didn’t realise that it was also on the front page.

Some Figures

The post was on Digg’s front page for just over three and a quarter hours. I served up around 10,000 page impressions to around 6,000 unique visitors in that time. I’ve served over 9,000 pages so far today, at a more steady rate, and continue to receive hundreds of hits per hour. I’ve transferred over 3.5Gb of data in total. Throughout this time the server has stayed up but the site has been very slow to respond and I am sure that many visitors gave up before they got a response (some of the comments on Digg are redirecting readers to cached versions of the page because of this). If the site had been up to it, it probably would have served thousands more pages. The article has received over 2000 “Diggs”, over half of them after it dropped off the front page.

It could have been worse. At least I wasn’t linked to by Yahoo!.

Remedial Action

Even after the worst of the spike the response times were terrible. CPU load was negligible but there was a bottle-neck elsewhere, ultimately down to insufficient RAM.

Searching for potential solutions, one idea I liked a lot was using mod_rewrite to redirect all traffic from digg.com to a Coral cached version of the page. I did set this up, but I couldn’t get Coral to cache the page because its requests to my server were being timed out.

Eventually I decided to replace Apache… with Apache. The default Apache install for Ubuntu is the prefork MPM (multi-processing module) version. Apparently this process-based implementation is the more robust option but it’s considerably more RAM-hungry than the alternative thread-based worker MPM. In my memory-constrained enviroment, the worker MPM seemed worth a punt.

Fortunately, Slicehost makes back-ups ridiculously simple, so I took a one-click snapshot of the server and then installed the worker version of Apache:

sudo apt-get install apache2-mpm-worker

This was more work than I was expecting because it involved uninstalling PHP. To use the worker MPM I would have to run PHP via FastCGI. I found these instructions on how to set that up.

Once I’d finally got the configuration right (the web server was down for about 20 minutes), there was an immediate improvement. The server seems to be dealing with requests very well now, though I still have not restarted Jetty. Whether this configuration would have stood up to the peak force of Digg’s DDOS remains to be seen. I was also considering a switch to Lighttpd, but that would have been a bit more drastic since I’ve no experience with it.

Other Tweaks

The next item on my list is to replace WP-Cache with WP-Super-Cache so that WordPress can serve fully static pages without having to load the PHP engine each time. Other suggestions that I found include optimising the WordPress MySQL tables and installing a PHP compiler cache. Maybe I also need to set-up some kind of monitoring to alert me when there is a huge traffic spike?

Fun with Yahoo! Pipes and Last.fm

Posted in Software Development, The Internet by Dan on July 24th, 2008

So I might be about 18 months late, but I finally got around to playing with Yahoo! Pipes today. I was aware of the basic concept but I was not aware of how impressive the implementation is. It’s an incredibly powerful tool with a slick UI that allows you to perform some pretty complex data processing without doing any real programming.

For my first experimental pipe, I just had it aggregate and sort the feed from this blog, my Google Reader shared links feed and my Flickr photostream feed. Easy-peasy.

Things got a bit more interesting when I tried to add my Last.fm “loved tracks” (favourites) to this super DanFeed. This is because Last.fm doesn’t actually have a feed for “loved tracks”. It has a feed for all recently played tracks, but I can’t really see the point of this because, with one entry every 3 or 4 minutes, it’s too much information for even the most dedicated stalker to digest.

Last.fm does however have a REST API to provide access to its data. Yahoo! Pipes is not restricted to processing RSS and Atom feeds. It can also extract data from JSON, CSV, arbritrary XML and even plain old HTML pages, so it didn’t take very long to get at the data I wanted.

After a little bit of trial-and-error I was able to include album art thumbnails in the feed too (for feed-readers that will display them). The only thing that wasn’t intuitive was how Pipes deals with dates for RSS entries. There was a lot of head-scratching before I finally succeeded in getting the dates from the Last.fm XML into the Pipes RSS.

The result of all of this is that I have published my first (semi-)useful Pipe, one that allows you to share your favourite tracks with your friends. In effect, they can subscribe your recommendations. The pipe is here. Just type in a Last.fm username and press the button. You can get a link to your personalised RSS feed from the output page. If you want to embed the feed (including the thumbnail images) on your website/blog/Facebook/whatever, just click on “Get as a Badge” after generating your custom feed.

Teach Yourself with University CS Resources

Posted in Software Development, The Internet by Dan on June 23rd, 2008

Over at DZone, I saw an article titled “Who Needs a Computer Science Degree When There’s Wikipedia?“. It suggests that you can learn as much from Wikipedia as you can by pursuing a formal university education in Computer Science. Sure, Wikipedia can be extremely informative (at least as an initial resource), but a random walk through the Wikipedia jungle could take you anywhere. It’s not a very structured syllabus.

I’ve been through a university CS education. I’m not going to argue the pros and cons of it here. Instead I’m more interested in how to acquire similar knowledge freely via the web. I’m certain that there are better approaches than trawling through Wikipedia (though Wikipedia would remain invaluable for background reading and finding references to more authoritative sources).

For me, the most obvious place to start is the universities themselves. Have a look at the Computer Science department websites and you will find that many of them provide access to course materials for anyone to download. One of the perils of teaching yourself is that you often don’t know what you don’t know. Unlike Wikipedia, the university content will be from a structured course, designed to teach the important stuff and avoid leaving huge blindspots in your knowledge.

Unlike going to university for real, you don’t have to worry about fees, academic records or geography. You get to pick from the best universities worldwide to provide your education. Leading the way is MIT and their Open Courseware program. This provides high quality content tailored for remote learning. But there are many other universities that provide access to lecture notes (or videos) and exercises.

I was thinking how useful it would be if there was a website that collated links to the best online CS course materials. Then, quite by accident, I stumbled across Google’s CS Curriculum Search. This is a Google web search restricted to the CS departments of universities. It categorises the results into “Lectures”, “Assignments” and “Reference”. It seems to be a very useful tool.

The Curriculum Search is part of the Google Code University, which includes their own content related to CS topics that are important to them (e.g. distributed computing and web security).

Another resource that may prove useful is Scholarpedia, which I have mentioned before.

Google takes on Wikipedia

Posted in The Internet by Dan on December 15th, 2007

The BBC brings news of Google’s plans for an online encyclopedia to rival Wikipedia.

The new project, called Knol, atttempts to address some of Wikipedia’s short-comings by putting more emphasis on respected authors and peer-reviewed content. In exchange for contributing, authors will receive a share of the ad revenue for their pages. Meanwhile, Wikipedia steadfastly refuses to display adverts, and instead relies on charitable donations to cover its costs.

This sounds a lot like the Scholarpedia project that I wrote about previously. But Scholarpedia lacks the considerable backing of the Google machine or the financial incentives of Adsense.

So will Google crush Wikipedia? Will Wikipedia have to adapt to survive? Or is it too entrenched already for Google’s efforts to have any real impact?

Wikipedia’s ad hoc editing certainly results in some interesting articles. During this year’s World Cup I found 3 separate pages detailing rugby player Jonny Wilkinson’s international points-scoring record, each with a wildly different number (including one that put him hundreds of points ahead of all-time record holder Neil Jenkins). Other things Wikipedia has taught me in the last year are that Clash frontman Joe Strummer was in favour of AIDS and global warming (or perhaps it was just a poorly constructed sentence), and that billionaire Chelsea Football Club owner Roman Abramovich is in fact a dustman.

Blog moved

Posted in The Internet by Dan on May 14th, 2007

I’ve moved from Blogger to my own site so that I can have more control (via WordPress). I’ve managed to import all of the old articles but seem to have lost the comments in the process. I’ve also updated my JavaBlogs entry, which has had the unfortunate side-effect of spamming the front page with old articles. Sorry about that…

New Adventures in Software by Dan Dyer