Why Demanding AI Workloads Need to Be Isolated in Their Own Environment Within the Data Center18 min read

by | Apr 10, 2024 | Blog

Some data centers have decided to go all-in on high-performance computing (HPC) and generative AI (GenAI). Others are thinking about it. Before pushing ahead with such plans, data center managers are advised to research the potential repercussions.

This begins by understanding the power and compute-density factors of these demanding workloads. One query into a large language model (LLM) such as ChatGPT, for example, generates 100 times more carbon than a Google search. But LLMs also require training that consumes up to 10 gigawatt-hour (GWh) for a single model. With many companies now building their own LLMs, it has become abundantly clear that GenAI applications place serious demands on power and compute resources.

GenAI Security and Legal Liabilities

There are plenty of risks inherent in utilizing GenAI. Its queries and especially its responses can be highly sensitive. Those using public LLMs must watch out for plagiarism and hallucinatory answers. Organizations also need to be aware that anything sent outside of the organization in a query could be open to attack and exposure.

As well as security and confidentiality concerns, legal liabilities are one of the big reasons why GenAI applications and hardware should be retained in-house. GenAI has already generated lawsuits. Carrie Goetz, Principal/CTO of StrategITcom and author of books such as Jumpstart Your Career in Data Centers, laid out some of the legal pitfalls and gotchas:

  • Employee A uploads company literature to a GenAI engine for manipulation/consideration before receiving output from it. 
  • Who is responsible for copyright infringement? 
  • Who is responsible to ensure trade secrets didn’t become part of the large language model due to that upload?
  • Who is going to ensure that company policies are followed?
  • Suppose there is some liability problem with the output; who exactly is liable? 
  • Suppose that information contains something illegal. Where are the boundaries of liability for transmission, assimilation, and distribution of the data? 
  • What if the outcome causes harm or violates personally identifiable information (PII) laws? 

“Legalities are at the forefront of most thoughts around AI, and the remainder are mostly policy considerations,” said Goetz. “HR policies will need to be adjusted, too.”

Performance Certainty

According to a report from Pure Storage, 88% of those adopting AI experienced a dramatic rise in computing power. 74% noted that AI necessitates major upgrades or a complete overhaul of IT infrastructure. As many as 47% have increased their computing power by double or more since adopting AI.

Part of the latency problem is the use of cloud-based GenAI resources. Anyone seeking to serve AI and HPC workloads over the cloud lacks certainty about where cloud resources are located. Failure to meet SLAs when using cloud-based GenAI services could generate contractual headaches. Hence, some are deciding to develop LLMs internally, adding to the data center burden. GenAI and HPC, after all, demand very low latency. Its traffic can’t be delayed due to inadequate storage, compute, and memory resources. Such resources typically must be located close to the application and not in the cloud. That means well designed, rack-dense resources dedicated to AI/HPC with as little latency as possible. Thus, the best approach is to implement them internally and isolate them from the rest of the data center behind their own firewall. They must be provided with abundant resources to ensure performance, eliminate latency, and provide legal protection as well as chain of custody on AI traffic. 

Cooling Infrastructure

Another area of isolation is cooling infrastructure. GenAI racks will generate a tremendous amount of heat. Liquid cooling will be needed. But it is expensive and carries its own categories of risk (including the potential for leakage). Thus, expect racks to emerge that are specifically designed for GenAI that incorporate advanced liquid cooling designs. These will be kept separate from the rest of the data center.

“Liquid cooling will be a requirement for those deploying 1,000+ watt chips efficiently and sustainably at scale,” said Lucas Beran, Research Director at Dell’Oro Group.

It all comes down to mechanics. The more power we put into the data center, the more waste heat is generated, and the more cooling is needed. There comes a point where air cooling can no longer cope on its own. And there will be points where so much heat is generated by GPUs and CPUs that liquid has to be part of the equation. That is likely to mean one part of the data center being liquid cooled to cope with GenAI needs and another remaining air cooled for traditional workloads.

“The industry is incorporating more efficient cooling means by conductive cooling with fluid,” said Bill Estes, General Manager for Anderson Power Products. “In some cases, that is through a cooling plate that runs fluid through a metal plate and there are also immersion cooling environments where you are taking an entire system and drop it into a tank of dielectric fluid.”

Isolating Workloads

Due to legal, performance and other considerations, then, it is probably best to isolate demanding HPC and GenAI workloads within the data center. This gives IT the opportunity to test and vet deployments before they are fully launched. For smaller LLMs, implementation will be easier as the outcomes are more predictable. Nevertheless, setting up these workloads behind their own firewall and fully separated from other data center resources provides a must-needed layer of protection.

“There is a risk avoidance through separation,” said Goetz.

Monitoring is Vital

Due to the vast quantities of compute, storage, and networking resources that can be consumed by GenAI and HPC, monitoring becomes more vital than ever.

“Monitor what happens on the network and servers to help estimate ongoing demand as you scale,” said Goetz. “Look for services like Interact that can pour over your stack and provide you with the most economical servers for the task.”  

New Life for Old Data Centers

For some enterprises, GenAI could mean that it is time to take their old data centers out of retirement or semi-retirement. It may be easier and more cost effective to upgrade an existing facility with all-new hardware than to try to run a demanding GenAI service in the cloud. If the business is intent on using GenAI to the fullest extent, the price tag for servers and other hardware powerful enough for GenAI may be less daunting than some of the cloud bills currently being generated. One cloud user reported a $13,000 bill for three hours of HPC.

“Many of the retired facilities that are lingering after a company moved to the cloud or colo may very well find a new purpose in the way of an AI engine,” said Goetz.

Real-time monitoring, data-driven optimization.

Immersive software, innovative sensors and expert thermal services to monitor,
manage, and maximize the power and cooling infrastructure for critical
data center environments.

 

Real-time monitoring, data-driven optimization.

Immersive software, innovative sensors and expert thermal services to monitor, manage, and maximize the power and cooling infrastructure for critical data center environments.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Drew Robb

Drew Robb

Writing and Editing Consultant and Contractor

Drew Robb has been a full-time professional writer and editor for more than twenty years. He currently works freelance for a number of IT publications, including eSecurity Planet and CIO Insight. He is also the editor-in-chief of an international engineering magazine.

Subscribe to the Upsite Blog

Follow Upsite

Archives

Cooling Capacity Factor (CCF) Reveals Data Center Savings

Learn the importance of calculating your computer room’s CCF by downloading our free Cooling Capacity Factor white paper.

Pin It on Pinterest