Having an application like Seasonality that relies upon online services requires those services to be reliable. This means any server I host has to be online as close to 100% of the time as possible. Website and email services are pretty easy to host out to a shared hosting provider for around $10-20/month. It’s inexpensive, and you can leave the server management to the hosting provider. For most software companies, this is as far as you need to go.
This also worked okay when Seasonality was simply grabbing some general data from various sources. As soon as I began supporting international locations, I stepped out of the bounds of shared hosting. The international forecasts need to be hosted on a pretty heavy-duty server. It pegs a CPU for about an hour to generate the forecasts, and the server updates the forecasts twice a day. Furthermore, the dataset is pretty large, so a fast disk subsystem is needed.
So I have a colocated server, which I’ve talked about before. It’s worked out pretty well until earlier this week when one of the 4 disks in the RAID died. Usually, when a disk in a RAID dies, the system should remain online and continue working (as long as you aren’t using RAID 0). In this situation, the server crashed though, and I was a bit puzzled as to why this occurred.
After doing some research, I found that the server most likely crashed because of an additional partition on the failed disk—a swap partition. When setting up the server, I configured swap across all four disks, with the hope that if I ever did go into swap a little bit it would be much faster than just killing a single disk with activity. The logic seemed good at the time, but looking back that was a really bad move. In the future, I’ll stick to having swap on just a single disk (probably the same one as the / partition) to reduce the chances of a system crash by 75%.
After getting a new disk overnighted from Newegg, I replaced the failed mechanism and added it back into the RAID, so the system is back up and running again.
This brings up the question of how likely something like this will happen in the future. The server is about 2 and a half years old, so disk failures happening at this age is reasonable, especially considering the substantial load on the disks on this server (blinky lights, all day long). At this point, I’m thinking of just replacing the other 3 disks. That way, I will have scheduled downtime instead of unexpected downtime. With the constantly dropping cost of storage, I’ll be able to replace the 300Gb disks with 750Gb models. It’s not that I actually need the extra space (the current 300s are only about half full), but I need at least 4 mechanisms to get acceptable database performance.
In the future, I will probably look toward getting hot-swappable storage. I’ve had to replace 2 disks now since I built the server, and to have the option of just sliding one disk out and replacing it with a new drive without taking the server offline is very appealing.