It’s out with the air and in with the liquid as densities continue to ramp up
Over the last few years, generative AI has been met with near-universal excitement as businesses race to harness its potential. Yet for all the fervor, it is important to consider the energy requirements needed to sustain such technology.
Building and training generative AI models demands vast amounts of energy, leading to both a sharp increase in power consumption and a greater need for dense computational resources. Data centers are at the forefront of this exponential rise and are projected to use even more power as this trend continues. The hardware powering generative AI, particularly GPUs, is highly energy-intensive, creating an urgent need for innovative solutions to manage the heat these systems generate.
Why air cooling is being exposed
Energy-intensive GPUs that power AI platforms require five to 10 times more energy than CPUs, because of the larger number of transistors. This is already impacting data centers. There are also new, cost-effective design methodologies incorporating features such as 3D silicon stacking, which allows GPU manufacturers to pack more components into a smaller footprint. This again increases the power density, meaning data centers need more energy, and create more heat.
Another trend running in parallel is a steady fall in TCase (or Case Temperature) in the latest chips. TCase is the maximum safe temperature for the surface of chips such as GPUs. It is a limit set by the manufacturer to ensure the chip will run smoothly and not overheat, or require throttling which impacts performance. On newer chips, TCase is coming down from 90 to 100 degrees Celsius to 70 or 80 degrees, or even lower. This is further driving the demand for new ways to cool GPUs.
Density is also very important. Liquid cooling allows us to pack a lot of equipment in a high rack density. With liquid cooling, we can populate those racks and use less data center space overall, and less real estate, which is going to be very important for AI.
Meeting the challenge
As generative AI’s energy demands escalate, liquid-cooled systems will emerge as an essential solution to meet the high energy density requirements. These systems not only help businesses optimize energy efficiency but also enable data centers to handle the growing number of GPUs driving future advancements. Given the immense power needs of generative AI, air cooling has become an inadequate option. The emergence of this technology has placed data centers under the spotlight and exposed them more than ever before, however, this is a great opportunity to take action and embrace innovative solutions that will match this challenge.
As a result of these factors, air cooling is no longer doing the job when it comes to AI. It is not just the power of the components, but the density of those components in the data center. Unless servers become three times bigger than they were before, efficient heat removal is needed. That requires special handling, and liquid cooling will be essential to support the mainstream roll-out of AI.
An emerging trend
Liquid cooling is growing in popularity. Public research institutions were amongst the first users because they usually request the latest and greatest in data center tech to drive high-performance computing and AI. Yet they tend to have fewer fears around the risk of adopting new technology before it is already established in the market.
Enterprise customers are more risk-averse. They need to make sure what they deploy will immediately provide a return on investment. We are now seeing more and more financial institutions – often conservative due to regulatory requirements – adopt the technology, alongside the automotive industry.
The latter are big users of HPC systems to develop new cars, and now also the service providers in colocation data centers. Generative AI has huge power requirements that most enterprises cannot fulfill within their premises, so they need to go to a colocation data center, to service providers that can deliver those computational resources. Those service providers are now transitioning to new GPU architectures and liquid cooling. If they deploy liquid cooling, they can be much more efficient in their operations.
Why liquid cooling is critical
Liquid cooling delivers results both within individual servers and in larger data centers. By transitioning from a server with fans to a server with liquid cooling, businesses can make significant reductions when it comes to energy consumption. But this is only at device level, whereas perimeter cooling – removing heat from the data center – requires more energy to cool and remove the heat. That can mean only two-thirds of the energy that the data center is using is going towards computing, the task the data center is designed to do. The rest is used to keep the data center cool.
Power usage effectiveness (PUE) is a measurement of how efficient data centers are. You take the power required to run the whole data center, including the cooling systems, divided by the power requirements of the IT equipment. With data centers that are optimized by liquid, some of them are doing PUE of 1.1, and some even 1.04, which means a very small amount of marginal energy. That’s before we even consider the opportunity to take this hot liquid or water coming out of the racks and reuse that heat to do something useful, such as heating the building in the winter, which we see some customers doing today.
https://www.datacenterdynamics.com/en/opinions/liquid-cooling-tech-represents-the-data-center-of-tomorrow/