Wednesday, May 20, 2009

Business Continuity Series, pt 3 - Service metrics - What are your goals?

Although we try our best to avoid failures with methodologies and goals like six sigma (the idea that output from processes should be contained within six standard deviations or approximately 3.4 failures per million) there are still some failures which need to be dealt with.

In the event of a system failure, there are two key metrics which are a good indicator of resiliency: Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

RPO refers to the amount of assets lost which can be quickly recoverable. For instance, an RPO of 24 hours for a database server means that if there is a failure (server crash, hard drive failure, building burns down) then the data that is restored is at most 24 hours old (or in other words, all data created in the last 24 hours is lost as a worst case scenario). RPO describes how current the information in your back up from auxiliary sources is.

RTO refers to the amount of time the process /service unavailability (time til service resumes). An RTO of 48 hours for cable television means that if a cable TV signal is disrupted (damaged line, transmitter failure, etc) that it will take the cable company 48 hours to restore service to your house.

The counter balance to achieving excellent RPOs and RTO's is cost. Generally speaking, the less latency for RPO and the less delay for RTO required, the more exponentially costly the solution (inversely proportional relationship).

Using a project management framework, the RTO of system system recovery is based on the critical path of recovering services (which in turn is heavily dependent on the system module with the longest RTO). And without a proper context most data will be useless so the weakest RPO in the system usually reflects the RPO of the system in general (a series relationship).

Email Example: A consultant backups their email every month locally on their laptop and their office mail server experiences an outage for 3 hours. The RPO in this scenario is one month (all the emails on their laptop) and the RTO is however long it takes the IT staff to restore email service (3 hours).

No comments: