Lessons Learned from the Massive Westhost Outage this Week
If you didn’t know, this has been a tumultuous week for clients of Westhost, my internet service provider. Their Primary data center is located in Utah and they share that space with a sister brand VPS.net. The datacenter is a Tier IV center managed by Consonus. Saturday afternoon there was a yearly fire equipment/alarm/suppression system test. The third party technician failed to follow procedures and one actuator remained on the output system for the gas that is designed to suppress fires in the building. When the system was re-armed there was a sudden release of the gaseous fire suppressant. At that same moment hundreds of hard drives died. Now, Inergen is what was used and the gases themselves shouldn’t be a problem. In this case, and judging from what I’ve read, the problem was with the sudden and intense change in air pressure caused by the release. That point is somewhat moot though, the end result is hundreds of dead and damaged hard drives.
Now, I have an account with VPS.net as well as westhost and have a VPS in the Utah datacenter with vps.net as well as several (12 or so) vps’s with westhost. They use Sphera as their Virtual machine platform, while vps.net is a xen based platform. One of the advantages that VPS.net had is their architecture is cloud based and another advantage from what I can see is that they have a smaller hardware footprint at the DC than Westhost. VPS first reported some sort of potential power issue at the DC early on, before finally getting the details that power was not the issue, fire suppressant was. Fortunately for them they were able to get enough drives replaced to be operational by Saturday night/Sunday morning and then back to 100% not long afterwards.
Westhost has not been nearly so fortnate. The first reports out of Westhost were that Sphera was acting up. A statement that was raked over the coals by many in hindsight as a lie. In all fairness, if you look at your network status monitor and see EVERYTHING going don – wouldn’t you wonder if your platform itself wasn’t the issue(?) I would. Information was slow coming out of Westhost, I think in part due to the size and scale of the damage. I have been lucky and only have seen sporadic downtimes, maybe a day at the most on Saturday. Some systems have not been so lucky and are still offline.
Among the issues have been multiple drives in the RAID arrays on individual machines have been hit. In some cases that would require restoring from backups. Well, the servers that host the backups were hit too. (Backups in the same building – a traditional IT no-no, but realistically in many hosting situations that is par for the course.) In the case of one group of servers they have now brought in data recovery experts to recover drives to restore from.
In all fairness, I think they’ve had a monumental repair job and have been handling it quite well. I would have appreciated more information up front, but being a self employed tech I know that talking doesn’t solve problems with hardware and can understand why it took so long to start getting the facts of the story as to what happened and passing them along.
The whole event though has had me thinking seriously about webhosting and the way we treat our data in the cloud. We assume that if our webhost does backups then we shouldn’t worry about making backups. But what if a comet hits their datacenter?
For many people a week of outage may not phase you. Maybe you just have a hobby site, or “it’s just a brochure” or “most of our leads come from local referrals anyway”. But, for many they may make hundreds or thousands a day from their site (or 10s of thousands). What then? First off. If your sites are making that much for you, does it make sense to be using cheap hosting? I think it’s a serious question to ask. I’m not saying that cheap is unreliable mind you, but with cheaper hosting you have fewer frills (backups not stored offsite for instance, single datacenter perhaps). There are services out there that are more expensive but can better assure uptime. There are services where you could have your site mirrored across several data centers and load balanced to whichever ones are the current most responsive.
I guess one of the things that I’ve considered is….. if your site is THAT important to you here’s what you should do.
1) look at and consider load balanced hosting with multiple datacenters.
2) Have a registrar for your domain name that is different from your hosting company. Why? The registrar is where you can set dns server information. If your hosting company is down you won’t be able to manage switching dns to another location.
3) Have a copy of your dns settings available offsite so that you can reconstruct the DNS if necessary elsewhere. This is especially useful for hosted google docs/gmail setups.
4) Have an offsite backup of all your essential configuration files (mail accounts and passwords for instance or mail aliases.) Virtual server settings for apache perhaps.
5) Have an offsite backup for any databases you use. You can script these to be automatically made and sent to an offsite location fairly easily, or pay someone to do so.
6) Have an offsite backup for any install files, html files or other tools that you make use of on your site.
Most sites are going to find this kind of contingency planning fairly easy. It’s only hard WHEN YOUR HOST IS DOWN and you don’t have an offsite copy. Some would say, well the hosting company should do this for me. The only thought I have to that statement is you should then not have a problem with your site being down. Take responsibility for the planning and you won’t regret it.
Now, I don’t mean to suggest that everyone should abandon your hosting the minute there is an outage. In fact, I see many moves like this as a temporary “bug out” until things settle down with your main provider. Actually as a VAR myself I’m thinking this is a service that I should be offering my hosting clients so that they don’t have to deal with the mess of moving firsthand.
So…. what do I mean by offsite? Well, you could of course, load all of the backups onto a server or desktop in your office or business. That’s a possibility, but you could also make use of various online storage possibilities like jungledisk or direct to Amazons S3 storage. You may want to just take advantage of a spare slicehost, vps.net or linode server as your backup server. My main preference would be to at least host it in a different datacenter from your primary vps or other hosting and potentially set it up with a different company. At the very least you should also research your backup plan hosting options from other providers in case yours is decimated by such an outage.
In respect to Westhost, I’m impressed with their work at this point, they’ve been one of the most responsive companies I’ve ever dealt with and they will get through this. Yes, some customers may go elsewhere and likely already have. That happens. As part of human nature, with an incident like what happened people will always have the tendency to look elsewhere for what they think is best for them. With proper planning though a temporary site can be up quickly and easily elsewhere while the dust settles and then if you like you can transition back to your previous hosting, or keep it as a spare in case your NEW location has a meltdown.
It would probably pay well to make your contingency planning for all of your sites today!