Who commits to five nines 99.999% availability?

Can anyone point me to a cloud hosting provider who will really commit to 99.999% availability? From what I’ve seen some offer this but only back it up with meaningless compensation if they fail such as refunding a months hosting fees which for any business which really needs 99.999% availability will be out of proportion with the damage unscheduled downtime brings. The other catch that you often see is 99.999% planned availability which means that they can schedule as much downtime as they like as long as the plan it and let you know in advance.

I ask this question because although I’m very much in favour of cloud hosting I increasingly feel that where clients really need 99.999% availability they need to look at a hybrid solution which either combines your own hosting scaling out to the cloud which has been done successfully, hosting in multiple availability zones with your chosen cloud platform or maybe ultimately a multi-cloud solution spreading the hosting over two or more cloud providers, this is the only solution if you really want to engineer out single points of failure.

So while contracts often call for five nines can the client afford that extra nine, are they prepared for the complications it brings and even if they are is it more of a target than something anyone will actually commit to in application hosting today?

Any application with 99.999% availability will need to have been designed from the ground up with that availability level in mind. You can’t take an existing application and easily retrofit this level of high or rather continuous availability. In reality, if you want to upgrade an existing application from 99.99% availability to 99.999% you are going to have to engage in a serious refactoring project.

Google Service Status Page – A great example of best practice.

Following the brief outage of gMail on September 1st I was reminded that Google publish a status page or dashboard showing the status of all their services.  You can find this service at www.google.com/appsstatus.  I mention this because it’s an excellent example of providing visibility and therefore accountability about the services you are providing which is essential if you’re being paid to provide a service.  If you’re responsible for providing various IT services to your business or customers then you really need to consider how you can create this type of service dashboard or status page.

If you’re involved in providing online services then you need to have formally agreed service up-time levels and planned maintenance times.  When agreeing up-time SLA’s you need to get people to understand the cost of moving from 98% to 99%, to 99.99% to 99.999% (five nines) up-time.  Have a think about it, the level of engineering needed to deliver 99% is quite different to 99.999%.

Availabilityper dayper monthper year
99.999%00:00:00.900:00:2600:05:16
99.99%00:00:0900:04:2300:52:36
99.9%00:01:2600:43:5008:45:57
99%00:14:2407:18:1787:39:30

If you commit to 99.999% up-time, you’re allowed 5 minutes a year, that’s not enough time to do anything so you need to your application to be running on a distributed system over two or more sites with instant fail over and probably load balanced workload.   In contrast 99% up time allows you 87 minutes of downtime which means that you can stick with simpler technologies like RAID and mirrored servers.

Let me know what you think and how you approach up-time SLA’s.