We Have Met the Enemy
of Site Uptime and He Is Us!
By Kenneth G. Brill
At the recent fall meeting, I presented data showing that 54% of all reported site infrastructure failures were coincident with human activity. It would seem obvious then, we are our own worst enemy!

I have since had a chance to reflect on this startling conclusion, and I am increasingly convinced that more procedures and better training will not alone solve the problem. Most companies reporting these human-induced failures already have good procedures in place, and many have made major investments in training. So, in spite of strong preventive measures, why do these failures occur?

We can find some insight in an incident that occurred early in the history of the Network. Three people were involved in taking a multimodule UPS off line — the UPS manufacturer’s service technician, the person reading the checklist, and a person observing what was happening. The service technician skipped ahead of the procedure, and before anyone could stop him, he had opened the system output breaker before closing the maintenance bypass breaker. A critical bus failure resulted. I discussed this incident in great detail with our members, and no one could imagine how something so stupid could happen, yet all agreed it was possible.

Our abnormal incident report (AIR) data says this event was not at all atypical and has since been repeated on other occasions at other sites. Many of you have heard the layman’s definition of insanity as, “continuing to do what you’ve always done, but expecting different results.” I therefore ask you the question, “What is going on when a carefully written procedure is not followed by people who know better?”

After my pondering and discussing this incident over the last few years, someone recently came up with a tactical solution which would have prevented the specific problem that caused the failure. Quite simply, the fix is preventing someone from getting ahead of the procedure. In this case, an additional step in the process would require the technician, the procedure reader, and the reviewer to each sign off on each step as they completed it. This would have slowed down the process and kept the technician from getting ahead. Some people I have talked with would also add a step of having all three people not only initial the preceding step to confirm it was completed, but also agree on what the next step should be.

However, initialing such a procedure is not a systemic answer to the underlying problem. The reason this site paid to bring in an outside service technician was that they didn’t have confidence in their internal staff and wanted to transfer any possible blame for failure to an outsider. The team went through the motions of a risk management process, but ultimately depended upon an outside expert to perform the one step which guaranteed a critical bus failure would occur should anything be done incorrectly. What does this thought process say about the internal staff’s ability to deal with a power system malfunction when a service technician is not available? It is one of several examples in the AIR database where management sadly discovered that an outside vendor doesn’t have any skin in the game when something goes wrong. I believe higher management bought into a feel good process which in the aftermath of the failure was revealed to be hollow and without substance.

I also believe the underlying issue is not with the people doing the work, but rather with management’s low level of commitment to doing the job as it should be done. At a meeting I recently attended, Tom Martin of Air Canada explained how their site infrastructure outage rate had been reduced from once every eleven months to once every four years. And they accomplished this impressive magnitude of improvement despite an infrastructure upgrade program that presented a significantly higher opportunity for failures. Having been a pilot, Tom took a lesson from his flight training whereby pilots keep a takeoff and landing checklist with them in the cockpit. In conjunction with the facility’s designers, Air Canada developed checklist procedures for critical site infrastructure activities. Each time a facility modification is made, the procedures are reviewed and updated in collaboration with the facility’s designers. Critical activities are scheduled through the IT change management system. A rehearsal with all the players is conducted several days prior to every critical activity, and a manager is present even if the critical activity is performed in the middle of the night. And a postmortem is conducted after completion of the activity to discuss any lessons learned and to document any changes which need to be incorporated into future activities.

What this process demonstrates is commitment, especially when a senior manager must be present when the work is performed. It also demonstrates how simple steps can produce dramatic and totally predictable results. I suspect that is why every military and commercial airplane in the world uses checklists. If the results are so predictable and easy, why don’t we all implement a similar set of processes for critical site infrastructure activities?

My best explanation for this contradiction goes back to the insanity definition. By human nature, we continue to do what we have always done because familiarity is comfortable. Doing something differently is uncomfortable and involves an element of personal risk. One approach could be to claim the Network’s data is invalid, and therefore we don’t have to change. But the data is true across four years of history and multiple sites, and the obvious conclusion is that having perfect procedures and training will not by themselves significantly reduce the current 54% activity failure rate. To be truly effective, management must actually talk the talk and walk the walk of implementing the steps required to achieve activity reliability. And as a result, the people performing the activity must feel both empowered and accountable for making risk management processes work.

Empowering people and trusting them to protect the business are counter to all the downsizing and outsourcing we have undergone in recent years. I anticipate these ideas will be extremely difficult to get across because they involve organizational changes in compensation, a shift from short-term to longer-term management strategies, the potential for having to increase staffing, actions to reduce staff turnover, changing training processes, certifying staff competencies, creating partnerships with vendors, changes in procurement practices to get the best result instead of the lowest price, and most importantly, empowering the people closest to the activity to make decisions.

These are tough and uncomfortable changes, however, which are outside most people’s (and their management’s) comfort zone. But what is the consequence of a site infrastructure failure for your facility or company? If the downtime consequences are not great, do nothing and don’t worry too much about your future. If the consequences are great, inaction probably means you will eventually be at risk of being replaced as the result of a site activity failure. The reality is rather clear. If you are not part of the solution, you will probably be seen as part of the problem.

It is sometimes useful to study other industries to see how they have handled similar issues. The military is a very hierarchical organization, but their traditional approach didn’t work in missile launch silos which had lots of complex electronic and mechanical systems. To achieve the military’s uptime objectives, the lowest technician often knew much more than the officers. For this particular job function, the Air Force evolved a more level form of organization. Outside the silo, the normal rules applied. However, within the silo, special personnel empowerment rules applied which gave the lowest technician the authority necessary to assure launch availability. Perhaps the same thing could happen in data center facility organizations.

I realize that I am preaching to the choir, and my message about the necessity of top down and bottom up commitment is not new news. But sometimes it is useful when an outsider like The Uptime Institute says what may already be obvious. For many years I have emphasized the fact that we are in the process of creating the profession of site uptime. We have a fundamentally different mission and therefore require a different level of empowerment than is appropriate for office buildings and other facilities. As true professionals, we need to tell higher management what is needed to protect the business of our companies. As other professions have learned before us, we need to take positions on the right way to do things so we are not our own worst enemy. The consistent 54% failure rate reported by our Site Uptime Network® members should provide you with all the statistics needed to make your business case for change.
TUI Home - Site Uptime Network - Network Member's Lobby - Industry Specifications
Certified Products/Sites - 2001 Seminars - Editorials - White Papers - Contact