Airflow Management Considerations for a New Data Center: Part 6: Server Reliability versus Inlet Temperature19 min read
Airflow management considerations will inform the degree to which we can take advantage of our excellent airflow management practices to drive down the operating cost of our data center. In previous installments of this seven-part series, I demonstrated that data centers could be run warmer than conventional wisdom would suggest before increased server fan energy reversed mechanical plant savings before server performance was adversely affected and before server price premiums consumed mechanical plant savings. I then suggested chiller-free data centers are much more realistic than conventional wisdom might purport and provided evidence that ICT equipment OEM’s tend to generally allow for wider humidity ranges than mainstream standards and industry guidelines. The first five parts of this series provided evidence from manufacturers’ product information, independent lab research results and math models that together make a rather compelling argument for the efficacy of designing, building and operating data centers without chiller plants or refrigerant cooling. Today we will look at the reliability of our ICT equipment in these erstwhile hostile environments.
Clearly, there is some temperature limit after which our servers will start suffering from higher failure rates; otherwise, why would all the manufacturers’ user documentation and various industry standards set upper and lower limits on the temperature of air entering that equipment? Over time, that envelope has expanded, and we were left to wonder what the definition of a short period might be when ASHRAE TC9.9 first created a category of allowable temperature limits: was that measured in minutes, hours or days and would violations produce catastrophic meltdowns or some accelerated rate of planned failures? Six years ago we got our answer,1 but for some reason, that breakthrough has not yet reached the ho-hum of old news. The breakthrough came when the nineteen major ICT OEMs on the ASHRAE TC9.9 IT Subcommittee figured out how to open their kimonos without giving away everything. Everyone in the business of designing, building and servicing ICT equipment has some kind of data base from their warranty experience on equipment failures, and most of that data includes internal footprints on conditions surrounding the failures, including temperature. While it would not have been prudent for those open kimonos to make semi-public information like – “We had 14% failures within eighteen months on platform L when operating at 90˚F for over 30% of that period.” You don’t open your kimono and hand your competition a camera. What they decided they could reveal was that at 68˚F their equipment would experience an X failure rate, something their regular customer base would be able to peg from experience, and then they could share their actual experience that at some specific temperature below 68˚F, their equipment could be expected to fail at 90% of X (0.9X) and that at some specific temperature above 68˚F, their equipment could be expected to fail at 1.15X, or whatever their history showed. The results of this exercise are summarized in Table 1 below, wherein the baseline is identified as forecasted equipment failures at 68˚F, with variations from that baseline at temperatures above and below the baseline for above average servers (lower bound), average servers, and below average servers (upper bound).
Relative Failure Rate x-Factor
The motivation for exploring these limits and thresholds is to determine if a case can be made for designing and operating a data center without a chiller or any refrigerant cooling. As such, we understand that our supply temperature will not be a constant set point but will rather use some form of free cooling to follow Mother Nature, within some reasonable bounds. For example, if we are using air-side economization in Minneapolis or Fargo or Cheyenne, we will capture enough return air in a recirculation mixing box to keep our minimum temperature above some predetermined level during the winter. For the Chicago and Boise examples discussed below, we will not let our minimum server inlet temperature slip below 59˚F, equivalent to the lower allowable boundary for Class A2 servers. With the release of the “X” Factor, ASHRAE presented a case study for Chicago where, because of the number of hours per year under 68˚, (wherein the data center would operate between 59˚F and 68˚ with free cooling with no chiller installed or operating), the server reliability would actually improve by 3% over the 68˚baseline, as summarized in Table 2 below.
Net “X” Factor = 0.97
|Inlet Temperature||The “X” factor||% of Hours|
|59⁰F ≤ T ≤ 68⁰F||.865||72.45%|
|68⁰F ≤ T ≤ 77⁰F||1.13||14.63%|
|77⁰F ≤ T ≤ 86⁰F||1.335||9.47%|
|86⁰F ≤ T ≤ 95⁰F||1.482||3.45%|
I have previously developed a similar case study for a data center in Boise that illustrates in more detail how the actual factor is calculated. The calculation process is rolled up in Table 3.
The total factored hours equals 7956, divided by 8760 hours a year produces an IT reliability forecast estimate of 91%, which means that operating the data center at these temperatures would have a 9% reliability improvement over running the data center all year at 68⁰F. Again, besides saving on both the capital and operational expense of some form of mechanical cooling, this data center would see fewer equipment failures than a data center running 24/7 with a 68˚F server inlet temperature. For some of my readers, I suspect that all the previous discussion has been nothing more than a rehash of the standard “X” Factor analytics. I hope you have stuck around, though, because this simple tool can be applied well beyond straightforward comparisons to the 68˚F baseline.
A project I worked on recently offers a glimpse of different practical applications for the X Factor. In this data center, there were twenty-seven temperature sensors more or less strategically located around the floor, and they had collected data with readings outside the intended range of 68˚F up to 80.6˚F, missing on both the over and the under. The site manager was concerned about how these discrepancies might have affected the reliability of his servers. The sensors were recording readings every ten minutes. For my first pass, I just looked at a period slightly longer than one month that represented the period with the highest count of sensor readings above the desired maximum. The compilation from the total available 20,136 hours (sensors X lines X 6) is captured in Table 4, with a resulting X factor total of 21,671 (hours X “X” Factor). Assuming a 68˚F baseline, we would, therefore, conclude our temperatures had produced some increase in server failures. However, rather than a theoretical baseline of 68˚F, this space had an actual baseline of 68-80.6˚F
If we equally divide the 20,136 hours over the four bins inside that 68-80.6˚F range, the baseline X factor would actually be 27,710, or an X-factor ratio of 1.38, or 28% higher than the actual ratio based on measured sensor data. Therefore, instead of the actual twelve server failures they experienced in twelve months out of a population of over 1090 available servers, they could have expected to see fifteen failures if they had operated within their design-intent temperature range all year. Obviously this methodology lacks absolute precision, but in this case, where the temperature sensor data would be surrogates for inlet temperatures in either the intended environment scenario or the actual environment scenario, it can provide a useful relative order of magnitude and some reassurance.
Finally, all the examples I have discussed have conveniently favored the practice of allowing some cooler free cooling temperatures to compensate for occasional excursions into higher temperature excesses. That is obviously not always going to be the case. For example, ASHRAE’s introduction of the X Factor identifies cities such as San Francisco, Seattle, Boston, Denver, Los Angeles and Chicago as locations where chiller-free data centers would have improved server reliability over the 24/7 68˚F baseline; whereas cities such as Houston, Dallas, Phoenix and Miami would see X factors over 1.0, with Phoenix and Miami over 1.2.5 Does that mean some geographic areas are off the table for consideration of chiller-free facilities? Yes, but maybe not as many as you might think. For example, consider the impact of a finding that said you would experience a 20% increase in server failures if you built a data center with no mechanical cooling in a particular location. What does that number actually mean? If your experience with 4000 servers has been that you would normally see 10 failures in a year operating at 68˚F, that number would then increase by 2 server failures out of 4000, and you would want to assess that exposure against the savings associated with more free cooling hours or, perhaps, of not including a chiller in the design and construction of your new data center. Likewise, if your normal experience suggests that you would expect three failures operating at 68˚F 24/7, then a 20% increase in failures would take nearly two years to produce an additional failure. That is a business decision.
It goes without saying, or it would except I’m saying it: Your mileage may of course vary, but it will be somewhere between good and wonderful when this exercise includes all standard best practices of airflow management. Conversely, if anyone takes this path and is disappointed with the results, the likely culprit is going to be poorly executed airflow management.
1 “2011Thermal Guidelines for Data Processing Environments – Expanded Data Center Classes and Usage Guidance,” White Paper, ASHRAE TC9.9
2 Thermal Guidelines for Data Processing Environments, 4th Edition, ASHRAE Technical Committee (TC) 9.9, Mission Critical Facilities, Data Centers, Technology Spaces, and Electronic Equipment, 2015, page 30
3 Thermal Guidelines for Data Processing Environments, 4th Edition, ASHRAE Technical Committee (TC) 9.9, Mission Critical Facilities, Data Centers, Technology Spaces, and Electronic Equipment, 2015, page 111
4 “Understanding the Relationship between Uptime and IT Intake Temperatures,” Ian Seaton, Upsite Technologies Blog, November 19, 2014, pages 3-4
5 Thermal Guidelines for Data Processing Environments, 4th Edition, ASHRAE Technical Committee (TC) 9.9, Mission Critical Facilities, Data Centers, Technology Spaces, and Electronic Equipment, 2015, page 31
Airflow Management Awareness Month 2019
Did you miss this year’s live webinars? Watch them on-demand now!
Data Center Consultant
Let's keep in touch!
Airflow Management Awareness Month 2019
Did you miss this year’s live webinars? Watch them on-demand now!