Following the brief outage of gMail on September 1st I was reminded that Google publish a status page or dashboard showing the status of all their services. You can find this service at www.google.com/appsstatus. I mention this because it’s an excellent example of providing visibility and therefore accountability about the services you are providing which is essential if you’re being paid to provide a service. If you’re responsible for providing various IT services to your business or customers then you really need to consider how you can create this type of service dashboard or status page.
If you’re involved in providing online services then you need to have formally agreed service up-time levels and planned maintenance times. When agreeing up-time SLA’s you need to get people to understand the cost of moving from 98% to 99%, to 99.99% to 99.999% (five nines) up-time. Have a think about it, the level of engineering needed to deliver 99% is quite different to 99.999%.
|Availability||per day||per month||per year|
If you commit to 99.999% up-time, you’re allowed 5 minutes a year, that’s not enough time to do anything so you need to your application to be running on a distributed system over two or more sites with instant fail over and probably load balanced workload. In contrast 99% up time allows you 87 minutes of downtime which means that you can stick with simpler technologies like RAID and mirrored servers.
Let me know what you think and how you approach up-time SLA’s.