A few days ago the server went bonkers and locked up. At the time I was not sure why. I restarted the HTTP server (in our case, Lighttpd) and everything seemed to return to normal. That’s usually how you know everything is completely broken.
Today the server went nuts again. For the nerds amongst you, it had a wait ratio of 95% and load average of around 20. That’s not the worst in history, I grant you, but for a modest 256Mb VPS it’s Bad News.
More incredibly nerdy details below the fold.
It seems that I had mucked up a previous bit of preventative maintenance. Lighttpd on Ubuntu reads its configuration files out of /etc/lighttpd/conf-enabled, a directory with symlinks to /etc/lighttpd/conf-available. Files are loaded in the same fashion as SVR5 startups — each one starts with a number and goes up.
Turns out I assumed normal English ordering, not SVR5/ASCII orderings.
The upshot is that instead of using FastCGI, where several copies of PHP are started and kept in memory to serve requests, Lighttpd was instead using plain old slow CGI, where it starts up and destroys a full copy of PHP every time someone visited the page. When there are only a few visitors this is not a problem. If there’s a sudden burst — usually a group of search engines deciding simultaneously to index you, say — then the overhead of constantly swapping copies in and out of RAM leads to thrashing on the virtual disk. Then requests begin to pile up in Lighttpd, which means that it just keeps getting slower and slower.
This is, at least, my current best theory. I’ve renamed the various configuration files into the order I want them to load — correctly giving preference to FastCGI over plain CGI. I am hopeful that this will alleviate, and indeed prevent, the problem in future. A quick check of performance using the Apache benchmark tool ab2 suggests that it’s working as it should do under load.