Understanding the Relationship between Uptime and IT Intake Temperatures13 min read
The relationship between uptime and IT intake temperatures can be viewed in two broad categories, both of which should be considered in the overall thermal management and airflow management strategies of data center design and operation. One category can be summarized in the label “thermal runaway.” Thermal runaway refers to the rapid increase of temperature during a mechanical system outage or failure, resulting in either rapid shutdowns of IT equipment in response to the elevated temperatures or actual catastrophic failure in equipment that fails to protect itself. The other category can be described as “ASHRAE allowable thresholds.” The ASHRAE TC 9.9 allowable thresholds are distinguished from the recommended environmental envelope, by virtue of both higher and lower temperature limits and a potential impact on IT reliability over time. These two categories are very different, but equally widely misunderstood, frequently resulting in design and operations waste.
A failure of the entire cooling mechanical plant while IT equipment continues to operate powered by UPS back-up power will result in what we call thermal runaway, if cooling is not restored in just a few minutes, or even seconds. As the term implies, thermal runaway refers to the condition of very rapid temperature rise that, left unabated, will exceed IT equipment maximum thresholds, leading to shut-down or even failure. A commonly held misconception is that the lower the data center operating temperature is set, the more time there is to recover from a cooling failure and avoid the ultimate conclusion of thermal runaway. Granted, a data center operating with a 65⁰F supply temperature will take longer to reach a critical shut-down temperature than a data center operating with a 75⁰F supply temperature. However, that additional time is not likely going to be adequate to bridge to effective response activities.
There continues to be debate in the industry and I’ve heard positions ranging from shut-down temperatures being reached within 5-6 seconds of a cooling shut-down all the way to claiming that there is no such thing as thermal runaway. Actual empirical studies indicate that there is a rapid temperature rise after a cooling failure and that more effective mitigations of this trajectory are achievable from ceiling height, amount of metal surface area in the data center, employment of hot aisle or chimney containment, and managing density than is achievable from a lower supply air temperature setting.
What the Research Says
Based on research reported on in a Dell white paper titled, “Facility Cooling Failure: How Much Time Do You have?”, written by David Moss, at 150 watts per square foot, decreasing the supply temperature from 77⁰F to 65⁰F would add about 45 seconds before average IT inlet temperatures would hit 95⁰F. At 250 watts per square foot, the lower temperature would buy an extra 20 seconds; at 350 watts per square foot, 5 seconds and at 450 watts per square foot, about 2-3 seconds. The Dell paper’s recommendation is that typically the energy savings resulting from operating at the higher temperature will more than offset the cost of putting air movement on UPS, which then increases the thermal ride-through time at 350 watts per square foot from ten seconds to five minutes. In another study reported on by Kishor Khankari in an ASHRAE transaction titled, “Thermal Mass Availability for Cooling Data Centers During Power Shut-Down,” lower density cabinets exhibited better thermal ride-through. For example, a room with 1.5kW cabinets went well beyond five minutes without approaching either the 95⁰F or 115⁰F thresholds with either a 65⁰F or 75⁰F supply temperature. Cabinets with a 2.5 kW IT load survived a cooling failure for 3 minutes a 15 seconds with a 65⁰F supply and 1 minute 15 seconds with a 75⁰F supply temperature after a cooling failure, the 115⁰F threshold was not reached after five minutes. At 5kW per cabinet, the 95⁰F threshold is reached in 45 seconds with a 65⁰F supply and in 25 seconds with a 75⁰F supply. The 5kW cabinets survived 2 minutes and 5 seconds to the 115⁰F threshold with a 65⁰F supply temperature and 1 minute 20 seconds with a 75⁰F supply air temperature.
Of more relevance to today’s newer data centers, Kankhari’s study found much shorter ride-thorugh times for higher density data centers. In a room with 10-20kW cabinets, 95⁰F supply was hit at 35, 37 or 45 seconds, depending on the number of cabinets in the room. These cabinets hit the 115⁰F threshold between 65-80 seconds, unless there were 100 cabinets in the room, in which case ride-through lasted just over three minutes. The implication of the higher density tests appears to be that deploying chimney cabinets or hot aisle containment with the extra sheet metal for the common hot aisle exhaust duct provided better thermal ride-through via mass available for thermal absorption than could be achieved by lower set points. Regardless, the data center manager/designer needs to assess what types of failure response activities could be accomplished during the additional seconds that might be made available by cooler set points and weigh what could be accomplished in that time against the cost savings of operating at the higher temperatures while running air movement (e.g., CRAH fans) on UPS.
ASHRAE Allowable Thresholds
The normal operating temperature of the data center also has an impact on IT equipment reliability, and is a source of even more confusion than the thermal ride-through question. ASHRAE TC9.9 has devised a formula for determining that impact, which they have presented in a white paper and the 3rd edition of Thermal Guidelines for Data Processing Environments. First, let’s be very clear that you will have a hard time finding anyone who proposes operating a data center all year long at somewhere between 90-95⁰F, the maximum allowable temperature for class A2 servers; rather the concept is to allow the data center temperature to float within the allowable temperature range for whatever class servers are deployed, following the direction of mother nature, modulated by any approach temperature to whatever technology of economization cooling is being used. In order to providing guidance to data center operators in determining the effect on IT equipment reliability of operating at these higher temperatures (actually at wider temperature ranges), the IT equipment manufacturers provided the ASHRAE handbook writers with reliability forecasts for different operating temperatures, from which the committee development their “X” factor tool, Basically, the “X” factor begins with some understood baseline failure rate based on operating all year with a 68⁰F IT equipment inlet temperature. Then this failure rate is de-rated or extended based on variations from this baseline temperature according to the following criteria for Class A2 servers:
Dry Bulb Temperature Average Failure Rate X-Factor
To illustrate how this equipment uptime calculator works, see the following example for Boise, ID. Assuming a standard airside economizer without any approach temperature loss, the annual hours for each temperature group are accumulated and then factored by the reliability forecast factor to determine the effect on IT equipment uptime from allowing the data center operating temperature to follow mother nature.
Boise, ID Uptime Estimate
Temperature Hours “X” Factor Factored Hours
59 5460 0.72 3931.2
63.5 572 0.87 497.64
68 426 1 426
72.5 504 1.13 569.52
77 377 1.24 467.48
81.5 437 1.34 585.58
86 327 1.42 464.34
90.5 271 1.48 401.08
95 178 1.55 275.9
99.5 165 1.61 265.65
104 38 1.66 63.08
108.5 5 1.71 8.55
The total factored hours equals 7956, divided by 8760 hours a year produces an IT reliability forecast estimate of 91%, which means that operating the data center at these temperatures would have a 9% reliability improvement over running the data center all year at 68⁰F. To make that a more meaningful forecast, if we assume a data center with 1000 servers and a normal 99% reliability, or 10 failures per year for whatever cause, this 9% improvement would reduce that failure rate from 10 servers to 9.1 servers. Conversely, if this calculation were done for a warmer climate data center location and the result of the X-factor calculation was a 10% increase in failures, then a data center of a similar size would see almost one additional failure per year out of 1000 servers. With this data, the data center operator then has enough information to make an intelligent decision of weighing the risks and benefits of taking advantage of the allowable temperature ranges for more free cooling hours. In this example, assuming a baseline 1.35 PUE, the expanded free cooling boundary would produce somewhere around $39,000 annual energy savings (at $0.10 per kW/H) to weigh against the cost (or savings) associated with the IT equipment failure rate difference. I should point out that the X-factor values used in the above example were the average values and the handbook has higher and lower values to cover either newer or older equipment or high quality equipment versus lower quality equipment. In addition, this same handbook has graphs for calculating server fan energy at the higher inlet temperatures, so the ASHRAE handbook has more information for making more precise estimates.
In conclusion, there is a definite relationship between inlet temperature and IT equipment uptime, but not in the way many people conceive of that relationship. Running an extremely cold data center does not offer as much protection against thermal runaway as other strategies such as hot air containment, high ceilings, or putting CRAH fans on UPS. In addition, a meat locker-cooled data center will not assure better IT equipment uptime than a data center fluctuating temperature within the ASHRAE allowable limits and even very slight improvements may not be cost-justifiable against greatly reducing or altogether eliminating refrigerant cooling in the data center.
Data Center Consultant
Let’s keep in touch!
Airflow Management Awareness Month
Free Informative webinars every Tuesday in June.