Airflow Management in Colocation and Cloud: Who is Responsible?15 min read
How quick were you to answer that question? Is it the tenant? Is it the colocation or cloud provider? Or, does it all need to be spelled out in an SLA? If you picked one of those responses, you’re not wrong. But you may need some clarification.
Over the past few years I’ve seen easy-to-fix efficiency and even resiliency issues within data center environmental management go unnoticed. Will some of these issues cause an outage? No, probably not. But, will a missing blanking panel or an improperly sized air handler cause other issues? Most likely it will.
At a recent Data Center Dynamics conference in San Francisco, I was involved in very lively conversations around data center management and responsibility. Just because AFM is spelled out in an SLA doesn’t mean it’s actually designed properly. Sure, the data center facility itself is operating well, but who is in charge of your racks? Have you really done a good job creating and efficient airflow architecture for your data center?
There are Different Kinds of Data Center Partners
I’ve learned quickly that there are a lot of data centers out there. Many will offer unique services like data migration or backup, while others position themselves as leading hyperscale cloud providers. The point is that none of them are built the same and each could have their own unique management structure. I’ve seen data center partners that are super hands-on during the entire migration and engineering process. Others, simply send a security guard along with you to your cage and let you work. They’ll provide the space, power, and cooling. But the rack design and buildout are all up to you.
It’s a Joint Responsibility
The big however is that the customer or tenant needs to remain constantly vigilant. Or, be sure to work with a partner that has efficiency and design in their DNA. It’s important to remember that not everything is always within your control. In 2017, the Azure Cloud in Japan experienced a massive outage. What happened?
Design of the cooling system and the power distribution system had typical redundancy built in for backup. The cooling system is N+1, meaning there is an extra cooling unit available in case one fails. The power distribution system was running at N+2, but one UPS in the parallel N+2 lineup failed and power was cut off to the entire cooling system in the data center.
From there, a long list of services was impacted. This includes both storage and virtual machines, along with many more cloud services, such as Web Apps, Backup, HDInsight, Key Vault, and Site Recovery. Issues included unavailability of virtual machines and VM reboots.
“Engineers have identified the underlying cause as loss of cooling which caused some resources to undergo an automated shutdown to avoid overheating and ensure data integrity and resilience,” the Microsoft Azure Service status page statement a little bit after the outage happened.
It’s important to note that the data center is managed by a third-party vendor, not Microsoft.
A Best Practice Approach
Outages will happen. And, unfortunately, even with an N+2 design, you might have faults. However, never become complacent with your design just because ‘it works.’ Here’s a small list of best practices to help you share the responsibility of a good data center design:
- Review your contracts and SLAs. Know where your responsibility ends and where your partner’s role starts. It’s very important to have a clear delineation of responsibility.
- Communicate! If you’re not sure, ask! Whether in the cloud or colocation, you need to have regular meetings and conversations with your data center and facilities teams.
- Treat AFM and environmental management as a science. Designs change, racks evolve, and things need to be updated. If you’re planning on putting a new converged infrastructure solution into your racks, take a second to think about how that’ll impact your overall design. Do you need another blanking panel? If this is a major role out, should you do a computational fluid dynamics study? Are there areas you can actually improve your design?
- Leverage good partners. Not just leverage, challenge them as well. Make sure they can meet your standards and that they can grow with you as well. These are the kinds of partnerships that can drive meaningful changes in your data center which will improve efficiency and have a positive impact on your facilities bottom line.
- Look for areas of improvement. Your design is ever-evolving. This means you need to review your infrastructure to see where you can increase cooling capacity, identify isolated data center airflow issues, optimize cooling airflow, reduce bypass air leakage, and even leverage better tools required for airflow and data center environmental management.
One final note. If you’re operating an enterprise data center on your own, be sure to get help when needed. Far too often I see things fall through the cracks as the business tries to do what it’s best at. And, often it’s not running a data center. In fact, this is a big reason that enterprises want to work with cloud or colocation providers. They want to focus on their business while allowing the professionals to work with their data centers. Even in those cases, you’re not just handing over the keys to your kingdom. As a good tenant, you’re going to have to regularly check your gear and work with your partners to ensure everything isn’t just working, but working optimally.
Finger pointing starts very quickly when there’s a sizeable, and measurable, outage. No one wants to be blamed for losing millions of dollars because critical systems went down. However, my experience has shown me that the most successful and resilient platforms are the ones that proactively share the responsibility around design, maintenance, and deployment of new solutions. A tenant and partner relationship is really the key to ensuring you have the most uptime and the most efficient architecture.
Airflow Management Awareness Month 2019
Industry Analyst | Board Advisory Member | Writer/Blogger/Speaker | Contributing Editor | Executive | Millennial
Bill Kleyman is an award-winning data center, cloud, and digital infrastructure leader. He was ranked globally by an Onalytica Study as one of the leading executives in cloud computing and data security. He has spent more than 15 years specializing in the cybersecurity, virtualization, cloud, and data center industry. As an award-winning technologist, his most recent efforts with the Infrastructure Masons were recognized when he received the 2020 IM100 Award and the 2021 iMasons Education Champion Award for his work with numerous HBCUs and for helping diversify the digital infrastructure talent pool.
As an industry analyst, speaker, and author, Bill helps the digital infrastructure teams develop new ways to impact data center design, cloud architecture, security models (both physical and software), and how to work with new and emerging technologies.
Airflow Management Awareness Month 2019
Did you miss this year’s live webinars? Watch them on-demand now!