I was right about everybody having strong opinions about code commenting. The popularity of my previous post on the subject (perhaps due to its provocative title) brought this site to its knees.
The Calm Before the Storm
I was receiving higher-than-average traffic from the regular sources, such as DZone and JavaBlogs, but this traffic was still pretty trivial and nothing that the server couldn’t handle. The real problem occurred sometime on Sunday evening. While I was busy playing football (we lost 4-0), my post was rapidly gaining attention on Digg.
The entire Uncommons.org empire is hosted on a 256mb VPS (Virtual Private Server) from Slicehost. These 256 megabytes of RAM are shared between Apache, PHP, MySQL and Jetty.Though this is a modest amount of memory, it is sufficient for the traffic that I normally attract. I was not entirely unprepared for a traffic spike. I had installed the WP-Cache plugin for WordPress sometime ago to avoid unnecessary MySQL queries. I’d also tweaked the Apache configuration to make the most of the limited resources available to it (or so I thought – I guess I should take more notice of my own advice about optimisations).
For about two hours after the post hit the front page of Digg, I was completely oblivious to the pain being inflicted on my poor VPS. It wasn’t until I tried to visit my WordPress dashboard that I realised something wasn’t quite right. The server was not responding very promptly at all. SSH access was still available, though a little slow. First thing I checked was the load average. It was high but not massively so. Next I checked memory usage. Pretty predictably, all of the RAM and all of the swap space was being used.
I only use Jetty for running Hudson and, while it is pretty lightweight for a Java server, it still accounts for a lot of the memory usage on a 256mb slice. So I shut it down to release some resources.
I didn’t have to do much analysis to figure out where the traffic was coming from as my first guess was right; I quickly found my post in the Technology section of Digg. Initially I didn’t realise that it was also on the front page.
The post was on Digg’s front page for just over three and a quarter hours. I served up around 10,000 page impressions to around 6,000 unique visitors in that time. I’ve served over 9,000 pages so far today, at a more steady rate, and continue to receive hundreds of hits per hour. I’ve transferred over 3.5Gb of data in total. Throughout this time the server has stayed up but the site has been very slow to respond and I am sure that many visitors gave up before they got a response (some of the comments on Digg are redirecting readers to cached versions of the page because of this). If the site had been up to it, it probably would have served thousands more pages. The article has received over 2000 “Diggs”, over half of them after it dropped off the front page.
It could have been worse. At least I wasn’t linked to by Yahoo!.
Even after the worst of the spike the response times were terrible. CPU load was negligible but there was a bottle-neck elsewhere, ultimately down to insufficient RAM.
Searching for potential solutions, one idea I liked a lot was using mod_rewrite to redirect all traffic from digg.com to a Coral cached version of the page. I did set this up, but I couldn’t get Coral to cache the page because its requests to my server were being timed out.
Eventually I decided to replace Apache… with Apache. The default Apache install for Ubuntu is the prefork MPM (multi-processing module) version. Apparently this process-based implementation is the more robust option but it’s considerably more RAM-hungry than the alternative thread-based worker MPM. In my memory-constrained enviroment, the worker MPM seemed worth a punt.
Fortunately, Slicehost makes back-ups ridiculously simple, so I took a one-click snapshot of the server and then installed the worker version of Apache:
sudo apt-get install apache2-mpm-worker
This was more work than I was expecting because it involved uninstalling PHP. To use the worker MPM I would have to run PHP via FastCGI. I found these instructions on how to set that up.
Once I’d finally got the configuration right (the web server was down for about 20 minutes), there was an immediate improvement. The server seems to be dealing with requests very well now, though I still have not restarted Jetty. Whether this configuration would have stood up to the peak force of Digg’s DDOS remains to be seen. I was also considering a switch to Lighttpd, but that would have been a bit more drastic since I’ve no experience with it.
The next item on my list is to replace WP-Cache with WP-Super-Cache so that WordPress can serve fully static pages without having to load the PHP engine each time. Other suggestions that I found include optimising the WordPress MySQL tables and installing a PHP compiler cache. Maybe I also need to set-up some kind of monitoring to alert me when there is a huge traffic spike?