Research Article算电协同
Johnny R. Zhang、Gaoyuan Du、Qianyi Sun、Shiqi Wang、Jiaxuan Li、Xian Sun
Published 2026-05-05 · arXiv · Credibility S
AI data centers are increasingly becoming tightly coupled compute--energy systems, where workload placement, cooling demand, electricity procurement, storage operation, and carbon emissions interact over time. This paper studies carbon-aware compute--power scheduling for geographically distributed AI data centers with microgrid prosumer capabilities. We propose a mixed-integer linear programming (MILP) framework tha…
Abstract, interpretation and reference
Abstract
AI data centers are increasingly becoming tightly coupled compute--energy systems, where workload placement, cooling demand, electricity procurement, storage operation, and carbon emissions interact over time. This paper studies carbon-aware compute--power scheduling for geographically distributed AI data centers with microgrid prosumer capabilities. We propose a mixed-integer linear programming (MILP) framework that jointly schedules rigid training jobs, routes elastic inference workloads, dispatches local generation and battery storage, and manages bidirectional grid interaction under latency, continuity, power-balance, and carbon-budget constraints. The model captures two key features of emerging AI infrastructure: heterogeneous workload flexibility and site-level energy prosumer operation. Experiments on synthetic yet practically motivated instances show that the proposed joint MILP substantially improves total operational benefit over compute-only and energy-only baselines while reducing emissions. The results further indicate that inference-routing flexibility is a major source of value, battery storage provides useful temporal flexibility, and local-generation-rich settings are particularly favorable. The framework provides a tractable optimization abstraction for sustainable and grid-interactive AI data centers.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Johnny R. Zhang, Gaoyuan Du, Qianyi Sun, 等. Carbon-Aware Compute--Power Scheduling for AI Data Centers with Microgrid Prosumer Operations[J/OL]. (2026-05-05)[2026-06-01]. http://arxiv.org/abs/2605.03751v2.
Research Article算电协同
Fiaz Hossain、Nilanjan Ray Chaudhuri、Alok Sinha、Sai Gopal Vennelaganti、Mohammed E. Nassar
Published 2026-05-02 · arXiv · Credibility S
A framework is established that assesses the impact of variations in artificial intelligence (AI) data center (DC) loads on the fatigue damage of steam/gas turbines of the synchronous generators (SGs) from torsional oscillations. Next, a simple three-step process that is supported by frequency-domain analysis is laid out to quantify the limits on fluctuations in AI DC loads. In the first step, the maximum allowable …
Abstract, interpretation and reference
Abstract
A framework is established that assesses the impact of variations in artificial intelligence (AI) data center (DC) loads on the fatigue damage of steam/gas turbines of the synchronous generators (SGs) from torsional oscillations. Next, a simple three-step process that is supported by frequency-domain analysis is laid out to quantify the limits on fluctuations in AI DC loads. In the first step, the maximum allowable variation in electrical power output at each SG terminal is independently determined from the first principles. This step needs only a lumped multi-mass model of the mechanical side of the SG. In the second step, we propose a new approach that relies on load flow to determine the so-called algebraic `interaction factor' that maps the change in AI DC load at a given bus to the corresponding change in each of the SG power outputs. In the third step, we propose a screening method to rank the candidate buses to site AI DCs and solve an optimization problem to determine the optimal allowable fluctuations in the AI DCs. We demonstrate the applicability of the proposed approach through frequency-domain and time-domain analyses in the modified IEEE 4-machine and IEEE-68 bus systems using a dynamic phasor framework. Finally, we demonstrate the scalability of the proposed approach on the synthetic 2000-bus Texas system.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Fiaz Hossain, Nilanjan Ray Chaudhuri, Alok Sinha, 等. Limiting the Impact of AI Data Centers on Fatigue Life of Thermal Turbine Generators in the Grid: A Frequency-Domain Approach[J/OL]. (2026-05-02)[2026-06-01]. http://arxiv.org/abs/2605.01173v1.
Research Article热管理与液冷
Jacob Morrison、Noah A. Smith、Emma Strubell
Published 2026-05-02 · arXiv · Credibility S
Modern language model development extends far beyond pretraining, yet environmental reporting remains narrowly focused on the cost of training a single final model. In this work, we provide the first detailed breakdown of the environmental impact of a full model development pipeline, from pretraining through supervised fine-tuning, preference optimization, and reinforcement learning, for Olmo 3, a family of 7 billio…
Abstract, interpretation and reference
Abstract
Modern language model development extends far beyond pretraining, yet environmental reporting remains narrowly focused on the cost of training a single final model. In this work, we provide the first detailed breakdown of the environmental impact of a full model development pipeline, from pretraining through supervised fine-tuning, preference optimization, and reinforcement learning, for Olmo 3, a family of 7 billion and 32 billion parameter models in both instruction-following and reasoning variants. We find that reasoning models are 17x more expensive to post-train than their instruction-tuned counterparts in terms of datacenter energy, driven by reinforcement learning rollout generation. Development costs (including experimentation, failed runs, and ablations) account for 82.2% of total compute, a roughly 65% increase over the ~50% reported for pretraining-focused pipelines in prior work. In total, we estimate our model development process consumed ~12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and consumed 15,887 kL of water, with water consumption driven entirely by power generation infrastructure rather than data center cooling. These costs, which are almost entirely unreported by model developers, are growing rapidly as post-training pipelines become more complex, and must be accounted for in environmental reporting standards and by the research community working to reduce AI's environmental impact.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Jacob Morrison, Noah A. Smith, Emma Strubell. The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining[J/OL]. (2026-05-02)[2026-06-01]. http://arxiv.org/abs/2605.01158v1.
Research Article算电协同
Jiyong Lee、Melody Agustin、Joanne Langsdorf、Erhan Kutanolgu、Michael Baldea、Ilias Mitrai
Published 2026-05-28 · arXiv · Credibility S
In this paper, we consider the expansion of power grids under emerging large loads from data centers and electrified manufacturing. We develop a multi-period grid capacity expansion model to determine optimal investment profiles for power generation, storage, and transmission capacity while accounting for hourly power dispatch, such that electricity demand is satisfied and the total planning and operation cost is mi…
Abstract, interpretation and reference
Abstract
In this paper, we consider the expansion of power grids under emerging large loads from data centers and electrified manufacturing. We develop a multi-period grid capacity expansion model to determine optimal investment profiles for power generation, storage, and transmission capacity while accounting for hourly power dispatch, such that electricity demand is satisfied and the total planning and operation cost is minimized. We also propose a new modeling approach regarding the spatial distribution of demand from large loads. The model is used to analyze the expansion of a synthetic grid that follows key characteristics of the ERCOT system over a seven-year planning horizon, under loads from data centers and electrified oil refining, which account for 17.5% and 4.7% of total annual electricity demand by the end of the planning horizon. The optimal investment policy leads to an 83.6% increase in generation capacity and exploits the short construction times of solar and storage as well as the operational flexibility of thermal generators. Finally, sensitivity analysis reveals that the construction time of grid assets substantially impacts investment timing, generation technology mix, and transmission capacity expansion. The proposed modeling framework is general and can be extended to other grid systems, enabling the exploration of diverse demand scenarios, policy assumptions, and regional characteristics.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用框架构建和频域/系统级分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Jiyong Lee, Melody Agustin, Joanne Langsdorf, 等. Grid Capacity Expansion under Data Centers and Electrified Manufacturing Large Loads[J/OL]. (2026-05-28)[2026-06-01]. http://arxiv.org/abs/2605.29053v1.
Research Article算电协同
Denisa-Andreea Constantinescu、David Atienza
Published 2026-05-26 · arXiv · Credibility S
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change …
Abstract, interpretation and reference
Abstract
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change in GPU power at the facility meter, where commitments are settled? We answer this on real hardware with GridPilot, a three-tier predictive controller operating across milliseconds, seconds, and hours, augmented by a deterministic safety-island bypass for fast response. On a three-GPU NVIDIA V100 testbed, GridPilot achieves a measured end-to-end trigger-to-target response of 97.2 ms, which is 6.9x faster than the 700 ms requirement of Nordic Fast Frequency Reserve. We further incorporate an instantaneous Power Usage Effectiveness (PUE) correction so dispatched commitments remain robust at meter level rather than only at IT load level. In replay experiments across six representative European grids (from Sweden to Poland), the PUE-aware controller closes 2.5-5.8 percentage points of cooling-overhead drag. GridPilot is released as open source and serves as a proof of concept that MW-scale AI/HPC demand can be engineered as controllable, grid-responsive flexibility by design.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用实验验证、原型测试或测量对比,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Denisa-Andreea Constantinescu, David Atienza. GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers[J/OL]. (2026-05-26)[2026-06-01]. http://arxiv.org/abs/2605.26384v1.
Research ArticleAI 运维优化
Roblex Nana Tchakoute、Claude Tadonki
Published 2026-05-23 · arXiv · Credibility S
High-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine chal…
Abstract, interpretation and reference
Abstract
High-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine challenge for the entire cloud-edge-HPC continuum at all scales, from low-power IoT microcontrollers to multi-megawatt data centers. Beyond financial costs, green computing is driven by considerations related to climate change and environmental concerns such as carbon footprint ($CO_2e$), as well as constraints on energy production and supply, leading to a real need to regulate {\em information and communication technology} (ICT) activities. This article presents a comprehensive overview of energy-efficient computing, taking into account the most recent and significant contributions. Based on this exploration of the state of the art, we design and describe a holistic taxonomy of the aforementioned publications, structured around various perspectives, including {\em hardware and software aspects, measurement instrumentation, software optimizations, dynamic task scheduling, voltage scaling, workload consolidation, federated learning}, and {\em cooling}. Particular emphasis is placed on large-scale AI, which receives significant attention due to its considerable resource requirements. We conclude with an analysis of a forward-looking roadmap that considers the main perspectives of sustainable computing.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Roblex Nana Tchakoute, Claude Tadonki. Energy-Aware Computing in the Year 2026[J/OL]. (2026-05-23)[2026-06-01]. http://arxiv.org/abs/2605.24569v1.
Research Article芯片与算力
Minghao Li、Alicia Golden、Samuel Hsia、Michael Kuchnik、Adi Gangidi、Xu Zhang、Ashmitha Jeevaraj Shetty、Zachary DeVito
Published 2026-05-23 · arXiv · Credibility S
The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we…
Abstract, interpretation and reference
Abstract
The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义:对日报读者而言,它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Minghao Li, Alicia Golden, Samuel Hsia, 等. ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training[J/OL]. (2026-05-23)[2026-06-01]. http://arxiv.org/abs/2605.24326v1.
Research Article热管理与液冷
Shrenik Jadhav、Zheng Liu
Published 2026-05-15 · arXiv · Credibility S
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops…
Abstract, interpretation and reference
Abstract
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Shrenik Jadhav, Zheng Liu. Co-Design Optimization for Data Center Cooling System via Digital Twin[J/OL]. (2026-05-15)[2026-06-01]. http://arxiv.org/abs/2605.15516v1.