|Greg Briggs' Technical Articles||Article Index|
The term "three nines" means 99.9% of the time, a system is correctly operating. Similarly, "four nines" means 99.99% of the time, and so on.
Three nines is a good target for your average internet web services, but even this may be difficult to attain.
It is 8.7 hours downtime per year. 44 minutes per month. 10 minutes per week.
You're right, three nines of time is not fair. It would be better to instead measure the percent of user requests actually succeeding versus failing. This way you get credit for doing your maintenance at 2 am. Call it "reliability" instead of "uptime" or something, but make a note of your computation method when claiming this.
A great question. You want to increase MTBF (mean time between failure) and decrese MTTR (mean time to repair). This can be achieve through redundancy and automatated repair. For example, use RAID to avoid the problem of disk failure. (Even better- use RAID with a hot spare.) Set up alarms-- scripts that continuously check whether your systems are working, and email or text message you when problems are detected. You might also have the scripts restart the deamons or servers affected. Alarms could also monitor log files. Finally, when systems do go down, see if you can identify the specific cause and how to avoid or mitigate it in the future.
Read more computer-related articles
© 2002-2009 Greg Briggs except where attributed otherwise