- The increasing energy and heat demands of AI are challenging traditional air cooling methods used in data centers.
- With expanding AI server density, liquid cooling is becoming essential.
- Despite the promising potential of hybrid cooling systems, their adaptation is progressing slowly.
The rapid advancements in artificial intelligence (AI) technologies are reshaping major industries, including logistics and search engines. However, this shift comes with significant hidden challenges, especially in the realm of data centers. The soaring energy requirements of generative AI are pushing traditional air cooling systems to their limits, stirring concerns about the infrastructure’s ability to manage these demands.
To explore these urgent issues, I engaged with Daren Shumate, founder of Shumate Engineering, alongside Stephen Spinazzola, the Director of Mission Critical Services. Both possess decades of experience in constructing expansive data centers and are currently focused on tackling the rising energy and cooling needs associated with AI technologies.
They shared insightful perspectives on how the industry must adapt its cooling solutions to meet the newly emerging challenges.
“A major issue we encounter in cooling data centers concerns power, water access, and physical space,” Spinazzola stated. “The intense computing demands of AI applications produce heat levels that conventional air systems struggle to manage.”
With the implementation of AI, server capacities have surged, often doubling or tripling workload requirements. As a result, traditional air-cooling methods are becoming inadequate. Computational fluid dynamics simulations across various data center setups reveal worrying temperature spikes that can exceed 115 degrees Fahrenheit, significantly heightening the risk of system failures.
While water cooling offers a more efficient and compact solution, it comes with its own set of challenges. A recent report highlighted that a single large facility might consume up to 1.5 million liters of water per day for sufficient cooling and humidity regulation.
This complexity poses significant hurdles for engineers who must design the next generation of data centers capable of accommodating AI’s extraordinary requirements.
As Spinazzola pointed out, “Traditional air cooling methods are becoming outdated in the wake of AI’s heat output. Direct liquid cooling (DLC) is vital, particularly when managing clusters of 20 to 30 cabinets, where each unit draws a minimum of 40 kilowatts.”
This transformation has substantial implications for energy consumption. For context, a single query made on ChatGPT can consume ten times the energy of a typical Google search, with more complex inquiries drawing even more processing power from interconnected AI clusters. “We must reevaluate our approach to power distribution and cooling in data centers, moving towards liquid cooling technologies,” Spinazzola observed.
Shumate further identified challenges in power delivery for AI computing, saying, “We are confronted with two main concerns: effectively transporting power from UPS output boards to high-density racks and adeptly managing high-density UPS power supplied by utility sources.”
Currently, the power distribution system relies on branch circuits connecting PDUs (Power Distribution Units) to individual rack PDUs or on busways mounted above racks. However, increasing power density demands an increased number of circuits to satisfy these evolving needs, complicating the distribution strategy. Moreover, varied client requirements regarding monitoring and distribution layouts compound the complexity.
To address these challenges, medium voltage UPS systems are often employed, though they may introduce issues such as the design and implementation of new substations.
When discussing cooling methods, Spinazzola elaborated on two primary forms of direct liquid cooling currently in use: immersion cooling and cold plate technology. Immersion cooling features servers housed vertically within a tank filled with a non-conductive liquid, which absorbs heat that is then transferred to a chilled water system. Although this method saves space, it requires specific server configurations.
In contrast, cold plate cooling uses heat sinks attached beneath chip stacks to draw heat, transferring it through a fluid to a cooling distribution unit that expels heat to a chilled water system. While effective, this approach necessitates significant fluid piping and advanced leak prevention systems.
The industry has long relied on air cooling; however, its inadequacy in handling high-density loads generated by AI underscores the need for a drastic shift. The inability of air-based cooling systems to keep pace with increasing loads raises critical reliability concerns, as illuminated by computational fluid modeling studies.
In response to such issues, Spinazzola discussed a new cooling system design called Hybrid-Dry/Adiabatic Cooling (HDAC). This innovative solution delivers two distinct cooling fluid temperatures through a single closed loop, efficiently cooling both direct liquid cooling servers and traditional air-cooled setups. Notably, HDAC consumes 90% less water compared to traditional chiller-cooling tower systems and cuts energy consumption by 50% when juxtaposed with standard air-cooled chillers, achieving a Power Usage Effectiveness (PUE) as low as 1.1 for hyperscale data centers required to support AI workloads. In comparison, PUE values for most AI data centers typically range between 1.2 to 1.4.
This significant reduction in PUE translates to around a 12% increase in usable IT power from the same size utility power feed, highlighting both economic and environmental advantages. Nevertheless, the primary obstacle remains: many organizations are hesitant to adopt pioneering strategies in this area.
In an ever-evolving technological landscape, balancing efficiency with sustainability has never been more urgent. Insights from industry experts stress the critical need for data centers to transition toward innovative cooling solutions that can keep up with the demands of cutting-edge AI technologies.
This represents a clarion call for engineers, technicians, and industry stakeholders to redefine standards and expand the horizons of what’s achievable in creating more resilient data center infrastructures.