Research Article热管理与液冷
Shaolei Ren、Mohammad A. Islam、Adam Wierman
Published 2026-05-30 · arXiv · Credibility S
The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for mu…
Abstract, interpretation and reference
Abstract
The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for multiple hours per day. Therefore, maximizing compute capacity under a limited site-level power budget is an important planning and operational challenge. Sizing the compute system conservatively based on peak cooling power can leave part of the site-level power capacity underutilized when the cooling power is below its peak, particularly in cooler months. On the other hand, sizing the compute system aggressively based on low cooling power can cause the total site-level power demand to exceed the site-level power capacity during hot days in the summer. This paper proposes ComputeAmp (Compute Amplifier), a framework that maximizes the compute capacity by jointly and dynamically leveraging cooling, battery energy storage, and computing-based adaptation. We discuss the opportunities and limitations of ComputeAmp and illustrate its potential to significantly expand usable compute capacity within local power and water resource limits. We also present a problem formulation for ComputeAmp and highlight a few algorithmic and operational challenges.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用框架构建和频域/系统级分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Shaolei Ren, Mohammad A. Islam, Adam Wierman. Maximizing Compute Capacity in AI Data Centers through Cooling, Energy Storage, and Computing Adaptation[J/OL]. (2026-05-30)[2026-06-24]. http://arxiv.org/abs/2606.00457v1.
Research Article算电协同
Soham Ghosh、Anik Goswami、Krishna Kumba
Published 2026-06-11 · arXiv · Credibility S
Floating solar photovoltaic (FSPV) systems provide a land-efficient pathway to expand clean electricity access in energy-poor regions. South America has among the highest global FSPV potential (approx 38.26 TWh per million acres of water surface), yet deployment remains limited. This study presents a techno-socio-economic framework to assess FSPV for energy access, water security, and grid flexibility, with case stu…
Abstract, interpretation and reference
Abstract
Floating solar photovoltaic (FSPV) systems provide a land-efficient pathway to expand clean electricity access in energy-poor regions. South America has among the highest global FSPV potential (approx 38.26 TWh per million acres of water surface), yet deployment remains limited. This study presents a techno-socio-economic framework to assess FSPV for energy access, water security, and grid flexibility, with case studies in Nicaragua, Honduras, and Guyana. Estimated yields for 50 to 398 MW systems exceed 1,500 to 2,000 kWh per kW annually with capacity factors above 20 percent. At El Cajon, FSPV could significantly reduce emissions relative to fossil generation. Results show competitive costs with land-based PV when accounting for avoided land use, shared hydropower infrastructure, and water benefits. The framework also highlights co-location with hydropower and AI data centers, offering a scalable model for deployment in underserved regions.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用框架构建和频域/系统级分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Soham Ghosh, Anik Goswami, Krishna Kumba. Pushing the Frontiers for Floating Solar Photovoltaics -- The Case for South America[J/OL]. (2026-06-11)[2026-06-24]. http://arxiv.org/abs/2606.12798v1.
Research ArticleAI 运维优化
Zihan Yu、Xianling Zeng、Zhiming Xue、Yalun Qi、Sichen Zhao
Published 2026-06-19 · arXiv · Credibility S
The rapid growth of large-scale AI workloads, particularly Large Language Model (LLM) training and inference, is fundamentally reshaping the operational dynamics of hyperscale data centers. Unlike traditional cloud workloads, AI-driven jobs exhibit bursty, high-intensity, and rapidly shifting resource demands, often leading to sudden capacity stress that cannot be effectively handled by reactive threshold-based mech…
Abstract, interpretation and reference
Abstract
The rapid growth of large-scale AI workloads, particularly Large Language Model (LLM) training and inference, is fundamentally reshaping the operational dynamics of hyperscale data centers. Unlike traditional cloud workloads, AI-driven jobs exhibit bursty, high-intensity, and rapidly shifting resource demands, often leading to sudden capacity stress that cannot be effectively handled by reactive threshold-based mechanisms. In this paper, we propose a deployment-oriented, burst-aware early warning framework for proactive capacity stress prediction under AI workload surges. We formulate the problem as a high-recall forecasting task over multivariate telemetry windows, with the explicit goal of enabling operational intervention before system degradation occurs. The proposed framework integrates workload intensity, temporal variation, and system pressure signals, and employs a lightweight tree-based learning model to capture nonlinear interactions in highly imbalanced environments. To evaluate the system under realistic conditions, we introduce an AI workload surge injection methodology that simulates burst-driven demand patterns observed in large-scale AI systems. Our XGBoost-based model achieves an ROC AUC of 0.697 and an AP of 0.670, significantly outperforming baseline methods. Under deployment-oriented threshold selection, the framework achieves a Recall of 0.914, enabling the detection of the majority of stress-prone periods with acceptable false-alarm cost. Beyond predictive performance, we show how the proposed framework can be integrated into operational control loops to support proactive actions such as workload throttling and resource scaling. Our results highlight the practical value of high-recall, learning-based early warning systems in enabling resilient and adaptive data center operations in the era of AI-driven workloads.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用框架构建和频域/系统级分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义:对日报读者而言,它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Zihan Yu, Xianling Zeng, Zhiming Xue, 等. Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers[J/OL]. (2026-06-19)[2026-06-24]. http://arxiv.org/abs/2606.21130v1.
Research Article芯片与算力
Salvatore Cielo、Elmira Birang、Alexander Pöppl、Sajad Azizi、Plamen Dobrev、Margarita Egelhofer、Ivan Pribec、Gerald Mathias
Published 2026-06-22 · arXiv · Credibility S
We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, At…
Abstract, interpretation and reference
Abstract
We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, AthenaK), and finite-element PDE solvers from the dealii-X Center of Excellence. For each code we measure throughput and energy efficiency expressed as compute-elements per wall-clock second (or per Joule of consumed energy) on a single compute node, comparing CPU-only (SPR) against combined CPU+GPU (SPR+PVC) configurations where available. Energy measurements rely on lightweight code instrumentation with p3em, or the Energy Aware Runtime (EAR) present on the system. Our results show that GPU offload yields $4-12\times$ higher throughput and up to $15\times$ better energy efficiency compared to CPU-only execution, with lammps and AthenaK benefiting most. However, both throughput and energy gains are sensitive to problem granularity: insufficient work per GPU tile erodes the accelerator advantage, as clearly observed in AthenaK at small mesh-block sizes. The power-budget utilization is systematically lower for CPUs than it is for GPUs, indicating that even at peak useful-work rate, most applications running on CPUs leave a significant fraction of the node's thermal envelope unused.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Salvatore Cielo, Elmira Birang, Alexander Pöppl, 等. Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2[J/OL]. (2026-06-22)[2026-06-24]. http://arxiv.org/abs/2606.23265v1.
Research Article热管理与液冷
Basit A. Akinade、Amobichukwu C. Amanambu、Jonathan M. Frame、Shaolei Ren
Published 2026-06-20 · arXiv · Credibility S
AI data centres consume water for cooling, water scarcity constrains siting, and AI tools can improve water system efficiency. These dynamics are studied separately yet form a feedback loop. This review formalises the Water and AI Feedback Loop, introduces the Water Consumption Impact index to quantify community-scale utility burden, and demonstrates across ten US sites that burden spans three orders of magnitude, f…
Abstract, interpretation and reference
Abstract
AI data centres consume water for cooling, water scarcity constrains siting, and AI tools can improve water system efficiency. These dynamics are studied separately yet form a feedback loop. This review formalises the Water and AI Feedback Loop, introduces the Water Consumption Impact index to quantify community-scale utility burden, and demonstrates across ten US sites that burden spans three orders of magnitude, from 0.2% to 134% of host capacity.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Basit A. Akinade, Amobichukwu C. Amanambu, Jonathan M. Frame, 等. AI Data Centers and the Water Use Feedback Loop[J/OL]. (2026-06-20)[2026-06-24]. http://arxiv.org/abs/2606.21760v1.
Research Article算电协同
Jason Crop、Hayden Moore、Sudeep Pasricha
Published 2026-06-10 · arXiv · Credibility S
As data center energy demand approaches grid-level constraints, optimizing conventional server infrastructure is essential for sustainable growth. The long-standing assumption that "cooler is better", i.e., lower CPU temperatures reduce power, does not fully hold for modern low-voltage CPUs, where inverse temperature dependence (ITD) drives higher supply voltages at lower temperatures. This creates a non-monotonic p…
Abstract, interpretation and reference
Abstract
As data center energy demand approaches grid-level constraints, optimizing conventional server infrastructure is essential for sustainable growth. The long-standing assumption that "cooler is better", i.e., lower CPU temperatures reduce power, does not fully hold for modern low-voltage CPUs, where inverse temperature dependence (ITD) drives higher supply voltages at lower temperatures. This creates a non-monotonic performance-per-watt curve where efficiency peaks at an intermediate thermal point. In this paper, for the first time, we empirically characterize ITD on production Intel Xeon CPUs and demonstrate that efficiency-optimal temperatures are CPU part-specific, and frequently higher than typical data center operating conditions. Measurements from commercial cloud data center platforms (Amazon, Equinix) reveal that approximately half of modern high-power CPUs operate about 10°C below their efficiency-optimal thermal point. By implementing ITD-aware thermal grouping of CPUs and inlet temperature adjustments, data center operators can optimize facility-level cooling and overall sustainability. Our case study shows that this approach can reduce total data center energy by 4-13% without sacrificing performance or reliability.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Jason Crop, Hayden Moore, Sudeep Pasricha. Revisiting "Cooler is Better": ITD-Aware Per-CPU Thermal Optimization for Sustainable Data Center Operation[J/OL]. (2026-06-10)[2026-06-24]. http://arxiv.org/abs/2606.11163v1.
Research Article算电协同
Yuhao Huang、Novarun Deb、Hamidreza Zareipour
Published 2026-06-19 · arXiv · Credibility S
The rapid expansion of artificial intelligence (AI) has driven unprecedented growth in data center electricity demand. The scale and pace of this load growth carry significant implications for the sustainability of electric power systems. On the one hand, rapid, spatially concentrated data center load growth is outpacing clean energy deployment in several major regions, raising emissions and challenging both grid fl…
Abstract, interpretation and reference
Abstract
The rapid expansion of artificial intelligence (AI) has driven unprecedented growth in data center electricity demand. The scale and pace of this load growth carry significant implications for the sustainability of electric power systems. On the one hand, rapid, spatially concentrated data center load growth is outpacing clean energy deployment in several major regions, raising emissions and challenging both grid flexibility and reliability. On the other hand, this fast-developing and capital-intensive sector offers abundant opportunities to advance sustainability through clean energy integration and operational innovations. This article provides an overview of the mechanisms through which data center affect power system sustainability, underscoring both risks and the potential. Specifically, this article (i) characterizes AI data center load behavior and categorizes electricity supply configurations by function and sustainability profile, as well as situates these loads within global and regional electricity demand trends; (ii) analyzes sustainability impacts across short-run operational and long-run planning mechanisms, evaluates effects on grid carbon emissions and renewable energy utilization, and feasibility of offering system flexibility and participating in ancillary service; and (iii) evaluates real-world corporate sustainability pathways and highlighting both the system benefits and feasibility limits of current carbon accounting practices. The goal of this work is to synthesize existing knowledge and technological developments and to guide research and development toward a more sustainable integration of AI data centers and electric power systems.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用文献摘要中的模型、实验或案例分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Yuhao Huang, Novarun Deb, Hamidreza Zareipour. AI Data Centers and Power System Sustainability: Understanding the Sustainability Implications of AI-Driven Data Centers on Power Systems[J/OL]. (2026-06-19)[2026-06-24]. http://arxiv.org/abs/2606.21064v1.
Research Article算电协同
Bojun Du、Xiaoyi Fan、Ershun Du、Long Chen、Jianpei Han、Qingchun Hou、Ning Zhang、Chongqing Kang
Published 2026-06-17 · arXiv · Credibility S
The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregat…
Abstract, interpretation and reference
Abstract
The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Bojun Du, Xiaoyi Fan, Ershun Du, 等. From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads[J/OL]. (2026-06-17)[2026-06-24]. http://arxiv.org/abs/2606.18851v1.