Is Liquid Cooling the Way of the Future?17 min read
For perspective, and to clarify what we are not discussing, it is useful to step back into data center history a bit and debunk some of the early claims made regarding liquid cooling. First, those early arguments rightly cited an older history of liquid-cooled mainframes before the evolution of file server computing platforms. However, the liquid cooling being promoted 5-10 years ago was not technically liquid cooling at all and has more accurately been more recently referred to as close-couple cooling.
As a matter of fact, those early liquid cooling systems still relied on conductive heat transfer of air moving across heat sinks inside the servers and removing that heat to cooling coils attached to the rear door of cabinets or located directly adjacent to cabinets or located directly above cabinets. The heat was then removed from these cooling coils in basically the same manner as it was removed from CRAC/CRAH coils, though with an exponentially larger network of plumbing hardware. Interestingly, the proponents of these systems touted the superior cooling capacity of water over air, with those claims of superiority ranging anywhere from 60X to 3100X. Besides addressing the liquid cooling that really wasn’t liquid cooling, some of this cooling performance rhetoric could also use a little debunking. One difference between water and air is heat capacity. Water has a heat capacity of 4184 joules per kilogram per degree Kelvin. Air has a heat capacity of 700 joules per kilogram per degree Kelvin. Clearly, water is superior in heat capacity, but only by a factor of 5.9X. Another way to slice this pie is to consider cooling capacity as a function of thermal conductivity. Water has a thermal conductivity of 0.6062 watts per meter per Kelvin, while air has a thermal conductivity of 0.0262. Clearly, water is a better thermal conductor, but only by a factor of 23X.
One might argue that applying “only” to 5.9X or 23X is trying too hard to make a point. After all, if your salary was 5.9X or 23X, you probably wouldn’t complain that you only got a 490% raise or only got a 2200% raise. Nevertheless, we are not talking about 3100X greater heat removal capacity, so how would 5.9X or 23X translate into practical cooling performance in the data center? If we assumed a small 1000 ft2 computer room with ten server cabinets and a 10’ high ceiling, we would be looking at a volumetric space of 10,000 cubic feet, or 4720 cubic feet after removing volumetric space for ten server cabinets and the cold aisles. By contrast, a thirty foot long 4” pipe (significant overkill) running from each close coupled cooling coil to some distribution point that would represent a similar distance to a chiller as the distance from CRAH units to chiller, would have a total volumetric area of 6.5 cubic feet. So even though water has a thermal capacity 5.9X more than air and a thermal conductivity 23X more than air, by volume, the volume of air available for heat removal in this example exceeds the volume of water by over 700X, so this mis-labeled liquid cooling was typically not as effective as air cooling. Considering that most of this close-coupled cooling plumbing has been 2” or 1” diameter rather than the 4” example, this performance delta in reality has likely been even greater.
However, efficiency is an entirely different metric than effectiveness and the proponents of these systems claimed efficiency advantages over air cooled systems. When comparing the efficiency of a closed-coupled (liquid) cooling system to a poorly executed legacy hot aisle – cold aisle (mostly, kinda sorta) data center, the close-couple data center design would always be more efficient. However, most of the bypass and re-circulation were eliminated by containment and/or other airflow management strategies so the efficiency comparisons came down to the fan systems of close-coupled systems versus the fan systems of perimeter cooling systems, it became quite a different story. Energy Efficiency Ratings (EER = BTUs removed for row-based cooling fan systems typically ran in the low to mid 40’s, while perimeter cooling units had EERs ranging from mid 30’s to the upper 40’s. On a level playing field, the efficiency differences were more dependent on vendor and model differences than technology differences.
All that history aside, today, liquid cooling actually means liquid cooling, and that is where the promise of liquid cooling comes much closer to delivering. There are three basic configurations of liquid cooling: immersion, direct contact and partial direct contact.
Probably the most widely deployed today is what I’ll call partial direct contact liquid cooling. In these systems, heat sinks on microprocessors and sometimes other heat-intensive components are replaced with liquid cold plates. Liquid moving through these cold plates removes the heat to a nearby liquid-to-liquid heat exchanger from which heat is removed to some heat rejection system – historically a chiller. These are partial direct contact liquid cooling systems because there are still plenty of heat sources in the server chassis which are cooled by air, thereby requiring some variation of a “normal” air-cooling mechanical infrastructure. Because of the parallel infrastructures, there may not be a large efficiency gain in some of these deployments. However, if these are retrofitted into an existing space, they can provide two significant benefits: 1. Removing 25-30% of the load on the air cooling infrastructure can result in a significant fan energy savings, 2. Extra computing capacity can be added to a space without having to add the associated air cooling infrastructure; hence, a path to extending the life of an existing space. The manufacturers of these systems have taken pains to simplify the modification of servers to accommodate the liquid cooling cold plates and associated plumbing, but wide-spread adoption remains inhibited by conservative IT reluctance to “open the box” and risk warranty issues with their server vendors. Another benefit of these partial direct contact cooling solutions is that the more effective cooling provides a path for over-clocking processors and thereby getting custom-processor performance at standard processor prices.
An evolutionary step to partial direct contact liquid cooling is complete direct contact liquid cooling. In this technology, little highly conductive metal “houses” are placed over every heat source on a server mother board. All these metal shrouds are of the same height and press against a large cold plate essentially the size of the mother board. All the heat from all the components is transferred to the cold plate through the bridging conductors, thereby eliminating the need for any fan powered air heat removal. This approach eliminates the need for the parallel infrastructure required by partial direct contact liquid cooling, thereby reducing both opex and capex. However, the heat removal path is much more intrusive to the servers and very unlikely executable at the user level; this approach is going to be much more closely tied to custom servers designed specifically for the cooling technology.
The farthest departure from what we understand as data center thermal management is total immersion cooling. Quite simply, this technology involves trading a server cabinet for a high tech horse trough and immersing the servers in a non-electrically conductive but thermally conductive oil bath. There are reports of 200kW racks (i.e., tubs), so a very obvious benefit of immersion liquid cooling is the ability to support extremely high densities. A frequently cited disadvantage of immersion cooling is that you lose your vertical scalability, which could result in lower per square foot power densities. While this may apply to standard commercial servers, if you can deploy super high density equipment and get over 100 kW per tub, you are probably not making a density sacrifice. More practical concerns have to do with the need for special hard drives and some special supporting infrastructure for vats and liquid storage. In addition, there may be some cultural issues with educated white collar IT specialists working around, and in, tanks of oil. As with other liquid cooling approaches, immersion provides a significant microprocessor over-clocking performance benefit.
Direct contact liquid cooling and partial direct contact liquid cooling have advantages for boosting the cooling capacity of existing spaces, while immersion cooling appears to make the most TCO investment sense for new construction. All three technologies support overclocking processors and getting super-computer performance from standard processors. In general, early adopters have tended to be universities and scientific laboratories where transactional throughput is highly valued. We have also seen some early adopters in the bitcoin mining community, where transactional velocity is also highly valued. Both the direct contact liquid cooling and partial direct contact liquid cooling work effectively with warm, even hot, water — 90⁰F or higher liquid can still effectively cool components and allow for over-clocking. For the time being, adoption will likely remain limited to those applications where we currently see these solutions deployed. However, that could all change with one of the major server OEM’s marketing computing hardware specifically designed for one or more of these liquid cooling approaches.
Data Center Consultant
Let’s keep in touch!