Tuesday, May 19, 2009

Business Continuity Series, pt 2 - Parallel versus Serial Failure and Resiliency

Before we can delve into the world of business continuity, we need to understand the underlying logic of systems design and the probability mechanics of describing failure. Taking a systems approach to redundancy planning, let's look at the mathematics behind failure probabilities of parallel systems and systems in series.

First let's look at a system in series:
The system above contains three modules in series, each with an 80% success rate. Each is independent of the others. The success rate of the system is the probability union of all three modules, in other words, in order for this system to work, you must traverse all three modules. The probability of success is as follows:

Success = 80% x 80% x 80% = 51.2%

Look familiar? It should. This is the exact same model I used for my post about the failure of communication between organizational levels and why smart people say stupid things with CEO's being on the left and mid-level managers on the right.

Note that even though each individual module has a fairly high success rate (80%) each incremental and potential failure compounds the overall success of the system. In series, all modules have to work in order for the system to work. This means that a system in series is vulnerable to single points of failure. If there is one point which goes down in the process, the whole system shuts down.

In human resources planning or even individual career development, being irreplaceable is identical to being a single point of failure.

Next let's look at a system in parallel:
The assumptions here is that each module is interchangeable with any other. That is to say that if one system fails, the other systems will pick up the slack. Here each module is fairly mediocre with a 60% success rate (or a 40% failure rate). However, for the system to fail, all three modules have to fail simultaneously. The probability of that happening is the union of all the failures:

Failure = 40% x 40% x 40% = 6.4%
Success = 1 - Failure = 93.6%

Notice that even while each individual component is not of particularly good quality, when they work together to ensure success they collectively cover for each other in the event of individual failures.

This model is analogous to electrical circuits (and the idea of resistance and conductance):
  • Modules are equivalent to resisters (from the perspective of conductance). Where conductance is a process channel.
  • Electrical current is work done.
  • Voltage differential potential work waiting to be done.
Remember, that formulas for electrical are analogous to fluid mechanics (if you come from a chemical or mechanical engineering background and feel more comfortable with those terms).
  • Modules are pipes
  • Water flow is work done
  • Pressure is potential work
With all these analogies, there are also problems associated with capacity. Although an individual failure might not disrupt a system with parallel components, if the system as a whole is operating at 90% capacity, the loss of one third of it's capacity is also a serious problem (system over capacity) and this will manifest in a variety of ways:
  • Unstable queue growth (work is coming in faster than you can process it)
  • Large (and growing) delay times (backlog)
  • Mechanical failures / server crashes / employee sickness (overworked)
In the next section, we will look at the goals of continuity planning, how to set goals and understand how to measure performance in an environment where an anticipated failure has occurred.

No comments: