How to Plan for Data Center Outages8 min read
First and foremost, when talking about data center outages, it’s important to clarify exactly what we’re talking about, as there may be some confusion. An “outage”, as defined here, is the loss of computing for any part or function of the data center. Anything short of that can be labeled as an “incident” that may include a save. For the purpose of this article, I’ll be discussing outages specifically, and how to prevent them, or in event that an outage occurs, how to handle the situation.
Prevention or mitigation of an outage or incident is best accomplished via a High Reliability Organization. A High Reliability Organization (HRO) is defined as: An organization that has succeeded in avoiding catastrophes in an environment where normal accidents (or failures) can be expected due to risk factors and complexity. You may say, “Hey, I don’t have high risk or complexity in my data center!” Well, I encourage you to think again because you most certainly do. What is the price of a downtime event in your data center – to you and your customers? Be assured that the bigger impact is through the social or political ramifications than it may be to the immediate financial impact.
HRO is a key ingredient of your data center’s governance. Governance is simply how the site is managed, how to protect the overall data center investment and how to achieve continuous availability goals.
Exactly what is a HRO? Three characteristics best define it:
- A Fierce Commitment to a Common Objective (Availability)
- Preoccupation with Failure (What is the next thing waiting to fail?)
- Unparalleled Attention to Detail (How will my next action affect the status of the computer room?)
What to Do When the Outage Happens
Key to managing the outage or incident is a prebuilt Incident Management program. Such a program identifies who does what, how, why and when. For example, it is imperative to ensure that effective communications occur during such an event. You don’t want everybody (clients, colocators, etc.) all calling the same person who is trying to get the place back running properly. Instead, assign someone to be the “Incident Manager” who handles all communications. This person is responsible for keeping everyone updated – perhaps through a conference bridge – and simultaneously communicating with those who are working the event.
A second most important aspect of managing the event is to get the site stable and then STOP! Leave the root cause analysis to a team who can take the time and dig into the problem. Many times, what initially seems to be the root of the problem is not actually the cause of the outage or incident. In fact, trying to troubleshoot further may actually exacerbate the problem.
Finally, determine lessons learned from the root cause analysis. What should change? Allow your analysis team to thoroughly do their job and then do yours. Once the problem is tracked to the root cause, perhaps equipment failure or a bad maintenance procedure, make the necessary changes. Be honest about what needs to be fixed. This is important!
Outages or incidents can and should be expected. Be prepared!
Airflow Management Awareness Month
Free Informative webinars every Tuesday in June.