Research Article算电协同
Denisa-Andreea Constantinescu、David Atienza
Published 2026-05-26 · arXiv · Credibility S
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change …
Abstract, interpretation and reference
Abstract
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change in GPU power at the facility meter, where commitments are settled? We answer this on real hardware with GridPilot, a three-tier predictive controller operating across milliseconds, seconds, and hours, augmented by a deterministic safety-island bypass for fast response. On a three-GPU NVIDIA V100 testbed, GridPilot achieves a measured end-to-end trigger-to-target response of 97.2 ms, which is 6.9x faster than the 700 ms requirement of Nordic Fast Frequency Reserve. We further incorporate an instantaneous Power Usage Effectiveness (PUE) correction so dispatched commitments remain robust at meter level rather than only at IT load level. In replay experiments across six representative European grids (from Sweden to Poland), the PUE-aware controller closes 2.5-5.8 percentage points of cooling-overhead drag. GridPilot is released as open source and serves as a proof of concept that MW-scale AI/HPC demand can be engineered as controllable, grid-responsive flexibility by design.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用实验验证、原型测试或测量对比,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Denisa-Andreea Constantinescu, David Atienza. GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers[J/OL]. (2026-05-26)[2026-06-04]. http://arxiv.org/abs/2605.26384v1.
Research ArticleAI 运维优化
Roblex Nana Tchakoute、Claude Tadonki
Published 2026-05-23 · arXiv · Credibility S
High-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine chal…
Abstract, interpretation and reference
Abstract
High-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine challenge for the entire cloud-edge-HPC continuum at all scales, from low-power IoT microcontrollers to multi-megawatt data centers. Beyond financial costs, green computing is driven by considerations related to climate change and environmental concerns such as carbon footprint ($CO_2e$), as well as constraints on energy production and supply, leading to a real need to regulate {\em information and communication technology} (ICT) activities. This article presents a comprehensive overview of energy-efficient computing, taking into account the most recent and significant contributions. Based on this exploration of the state of the art, we design and describe a holistic taxonomy of the aforementioned publications, structured around various perspectives, including {\em hardware and software aspects, measurement instrumentation, software optimizations, dynamic task scheduling, voltage scaling, workload consolidation, federated learning}, and {\em cooling}. Particular emphasis is placed on large-scale AI, which receives significant attention due to its considerable resource requirements. We conclude with an analysis of a forward-looking roadmap that considers the main perspectives of sustainable computing.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Roblex Nana Tchakoute, Claude Tadonki. Energy-Aware Computing in the Year 2026[J/OL]. (2026-05-23)[2026-06-04]. http://arxiv.org/abs/2605.24569v1.
Research Article芯片与算力
Minghao Li、Alicia Golden、Samuel Hsia、Michael Kuchnik、Adi Gangidi、Xu Zhang、Ashmitha Jeevaraj Shetty、Zachary DeVito
Published 2026-05-23 · arXiv · Credibility S
The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we…
Abstract, interpretation and reference
Abstract
The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义:对日报读者而言,它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Minghao Li, Alicia Golden, Samuel Hsia, 等. ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training[J/OL]. (2026-05-23)[2026-06-04]. http://arxiv.org/abs/2605.24326v1.
Research Article热管理与液冷
Shrenik Jadhav、Zheng Liu
Published 2026-05-15 · arXiv · Credibility S
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops…
Abstract, interpretation and reference
Abstract
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Shrenik Jadhav, Zheng Liu. Co-Design Optimization for Data Center Cooling System via Digital Twin[J/OL]. (2026-05-15)[2026-06-04]. http://arxiv.org/abs/2605.15516v1.
Research Article算电协同
Xin Lu、Jing Qiu、Jiafeng Lin、Sihai An、Mingyang Sun、Junhua Zhao
Published 2026-05-14 · arXiv · Credibility S
Emerging connect-and-manage practices allow new transmission-connected mega-loads to connect while enforcing time-varying admissible power exchange limits at the point of common coupling (PCC) in real time. Hyperscale artificial intelligence data centers (AIDCs), whose demand can reach hundreds of megawatts and whose internal computing-cooling dynamics evolve rapidly, can therefore face frequent conflicts between wo…
Abstract, interpretation and reference
Abstract
Emerging connect-and-manage practices allow new transmission-connected mega-loads to connect while enforcing time-varying admissible power exchange limits at the point of common coupling (PCC) in real time. Hyperscale artificial intelligence data centers (AIDCs), whose demand can reach hundreds of megawatts and whose internal computing-cooling dynamics evolve rapidly, can therefore face frequent conflicts between workload continuity requirements and externally imposed PCC envelopes. This paper proposes a battery-assisted operational framework in which on-site battery energy storage (BESS) serves as a physical buffering interface to reconcile fast internal dynamics with time-varying interconnection limits. A continuity-aware energy-computation model is developed to jointly capture checkpoint-constrained AI training workloads, information technology (IT) computing power-throughput characteristics, and IT-cooling thermal dynamics. A two-stage decision framework is then formulated, consisting of scenario-based day-ahead workload commitment and a real-time receding-horizon delivery assurance controller that enforces battery, thermal, and grid-interaction constraints. Case studies on the IEEE 39-bus system with Australian real data demonstrate that BESS substantially increases credible day-ahead workload commitment and improves real-time delivery robustness under transmission congestion. Sensitivity analyses further reveal a regime-dependent role transition of BESS -- from feasibility-oriented continuity support when PCC limits are binding to economy-driven flexibility provision as transmission constraints are relaxed.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用框架构建和频域/系统级分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Xin Lu, Jing Qiu, Jiafeng Lin, 等. Battery-Assisted Operation of Hyperscale AI Data Centers under Connect-and-Manage Interconnection Practices[J/OL]. (2026-05-14)[2026-06-04]. http://arxiv.org/abs/2605.14105v1.
Research Article热管理与液冷
Minghao Sun、Zehui Chen、Jinbo Hou、Kezhi Wang、Xiaoli Chu
Published 2026-05-13 · arXiv · Credibility S
The rapid growth of foundation model training and large-scale AI services has driven ground data centers toward unprecedented power densities, intensifying challenges in energy supply, cooling, and spatial scalability. Space Data Centers (SDCs) have emerged as a promising paradigm for hosting energy-intensive computing infrastructures in orbit, leveraging continuous solar energy and radiative cooling advantages. How…
Abstract, interpretation and reference
Abstract
The rapid growth of foundation model training and large-scale AI services has driven ground data centers toward unprecedented power densities, intensifying challenges in energy supply, cooling, and spatial scalability. Space Data Centers (SDCs) have emerged as a promising paradigm for hosting energy-intensive computing infrastructures in orbit, leveraging continuous solar energy and radiative cooling advantages. However, unlike ground facilities primarily constrained by power and site availability, SDCs are fundamentally limited by communication capability. The gap between petabit-scale internal data exchange in ground data centers and the gigabit-scale capacity of ground-space links forms a critical bottleneck. This article systematically analyzes communication constraints in SDC architectures and explores semantic communication as a key enabling paradigm. By transmitting compact, task-relevant semantic representations instead of raw data, uplink pressure can be substantially reduced. The feasibility of communication-efficient orbital AI infrastructures is demonstrated through the evaluation of a multi-layer heterogeneous SDC framework consisting of relay satellites and orbital computing nodes operating under coupled energy and thermal constraints. The article further outlines open research challenges toward scalable deployment.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Minghao Sun, Zehui Chen, Jinbo Hou, 等. Toward Communication-Efficient Space Data Centers: Bottlenecks, Architectures, and New Paradigms[J/OL]. (2026-05-13)[2026-06-04]. http://arxiv.org/abs/2605.12681v1.
Research Article能效优化
Xiang Liu、Shimiao Yuan、Zhenheng Tang、Peijie Dong、Kaiyong Zhao、Qiang Wang、Bo Li、Xiaowen Chu
Published 2026-05-12 · arXiv · Credibility S
LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to…
Abstract, interpretation and reference
Abstract
LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to-token production}. We formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but we use price dispersion only as directional motivation, not as causal evidence of marginal cost. The core physical question is instead: under fixed quality and service targets, when does the binding constraint move from theoretical peak compute toward delivered power, cooling, and operational efficiency? Under this framing, system optimizations -- latent KV-cache compression, sparse or heavily compressed attention, quantization, routing, and difficulty-adaptive reasoning -- are not merely local engineering tricks. They are energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed $(q^{*},s^{*})$. We therefore call for inference papers and benchmarks to report Joules/token, active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output alongside accuracy and latency.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,PUE/WUE、能效指标和运营成本控制正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断不同能效指标是否真实反映节能和成本收益。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Xiang Liu, Shimiao Yuan, Zhenheng Tang, 等. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production[J/OL]. (2026-05-12)[2026-06-04]. http://arxiv.org/abs/2605.11733v1.
Research Article热管理与液冷
Viktor Danchev、Alex Dyer、Sebastian Grau、Guillaume Vazeille
Published 2026-05-07 · arXiv · Credibility S
The Standard Model of particle Physics has been validated to extraordinarily high precision by the Large Hadron Collider (LHC). Yet it leaves some of the most fundamental questions in Physics unresolved: the nature of dark matter, the hierarchy problem, and the unification of forces. Multiple next-generation terrestrial colliders have been proposed such as the Future Circular Collider (FCC) which will reach centre-o…
Abstract, interpretation and reference
Abstract
The Standard Model of particle Physics has been validated to extraordinarily high precision by the Large Hadron Collider (LHC). Yet it leaves some of the most fundamental questions in Physics unresolved: the nature of dark matter, the hierarchy problem, and the unification of forces. Multiple next-generation terrestrial colliders have been proposed such as the Future Circular Collider (FCC) which will reach centre-of-mass energies of $\approx$100 TeV, yet the energy scales at which hints of Grand Unified Theories (GUTs) and string theory are expected to be observed ($10^{11}-10^{13}$ TeV) remain orders of magnitude beyond the reach of any terrestrial facility. We argue that the path to these energy frontiers inevitably leads to Space. By examining the fundamental scaling law for circular proton colliders, we establish that colliders of radius $10^3-10^5$ km are required to enter the PeV-EeV regime. In addition, Space-based colliders benefit from virtually free ultra-high vacuum ($< 10^{10}$ particles/m$^3$ above 1000 km altitude), passive cryogenic cooling, reduction of geological and political constraints, and perhaps most importantly -- the substantial reduction of the thermodynamic penalty that dominates terrestrial cryogenic power budgets. We survey existing proposals for beyond-Earth colliders, derive order-of-magnitude requirements for an orbital collider constellation, and assess feasibility against current and near-term spacecraft capabilities in formation flying, power generation, and precision attitude control. We conclude that recent developments in orbital infrastructure -- particularly gigawatt-scale orbital power architectures being developed for Space-based data centers -- are converging with the needs of a Space-based mega collider, making serious feasibility studies warranted and promising a more certain path towards the core questions of modern Physics.
中文解读
背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。
参考文献
Viktor Danchev, Alex Dyer, Sebastian Grau, 等. The Case for Space-Based Particle Colliders: Orbital Infrastructure as a Path to Grand Unification Energy Scales[J/OL]. (2026-05-07)[2026-06-04]. http://arxiv.org/abs/2605.08239v1.