智算中心论文观察｜2026-07-03

Current Issue

Volume 2026 · Issue 07-03

按期刊卷期页方式整理本期论文。每条仅使用日报已列出的可追溯公开来源，不新增未经核验事实。

Research Article热管理与液冷

Toward Next-Generation AI Data Centers: Power Delivery Architecture Shifts, Emerging Technologies, and Challenges

Sangwhee Lee、Rafal P. Wojda、Cheol-Hee Jo、Shuntaro Inoue、Pedro Ribeiro、Gui-Jia Su、Mostak Mohammad、Himel Barua

Published 2026-06-24 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth of AI workloads is driving unprecedented increases in data center power demand, current transients, and thermal stress, exposing fundamental limitations in traditional 48 V rack architectures, low-voltage AC distribution, and line-frequency transformer interfaces. This paper reviews the three stages of architectural shifts required to support next-generation AI data centers and identifies three enabling technological building blocks: high-voltage conversion-ratio DC/DC converters, facility-level low-voltage DC distribution, and medium-voltage solid-state transformers. The advantages, technical challenges, and potential solutions associated with each building block are reviewed. Finally, future research directions and open challenges are discussed.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Sangwhee Lee, Rafal P. Wojda, Cheol-Hee Jo, 等. Toward Next-Generation AI Data Centers: Power Delivery Architecture Shifts, Emerging Technologies, and Challenges[J/OL]. (2026-06-24)[2026-07-03]. http://arxiv.org/abs/2606.25095v1.

Full text 中文海报

Research Article芯片与算力

Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2

Salvatore Cielo、Elmira Birang、Alexander Pöppl、Sajad Azizi、Plamen Dobrev、Margarita Egelhofer、Ivan Pribec、Gerald Mathias

Published 2026-06-22 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, AthenaK), and finite-element PDE solvers from the dealii-X Center of Excellence. For each code we measure throughput and energy efficiency expressed as compute-elements per wall-clock second (or per Joule of consumed energy) on a single compute node, comparing CPU-only (SPR) against combined CPU+GPU (SPR+PVC) configurations where available. Energy measurements rely on lightweight code instrumentation with p3em, or the Energy Aware Runtime (EAR) present on the system. Our results show that GPU offload yields $4-12\times$ higher throughput and up to $15\times$ better energy efficiency compared to CPU-only execution, with lammps and AthenaK benefiting most. However, both throughput and energy gains are sensitive to problem granularity: insufficient work per GPU tile erodes the accelerator advantage, as clearly observed in AthenaK at small mesh-block sizes. The power-budget utilization is systematically lower for CPUs than it is for GPUs, indicating that even at peak useful-work rate, most applications running on CPUs leave a significant fraction of the node's thermal envelope unused.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义：对日报读者而言，它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Salvatore Cielo, Elmira Birang, Alexander Pöppl, 等. Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2[J/OL]. (2026-06-22)[2026-07-03]. http://arxiv.org/abs/2606.23265v1.

Full text 中文海报

Research Article算电协同

Modal Analysis of Spatial Load Correlation in AI Data Center-Dominated Power Systems

Chandan Chaudhary、Michael Murillo、Mohammed Ben-Idris、Joydeep Mitra、Dilip Pandit、Atri Bera

Published 2026-06-12 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

Hyperscale AI data centers induce spatially and temporally correlated load fluctuations that violate classical independence assumptions and are not captured by time-averaged spectral methods. These correlations are episodic and non-stationary, so they demand analysis that resolves transient structure. This paper applies Dynamic Mode Decomposition (DMD) to the temporal evolution of pairwise inter-bus correlation coefficients and forms a low-dimensional state representation that enables modal analysis without a stationarity assumption. The recovered modes distinguish sustained coherence, decaying transients, and intensifying events, and their oscillation timescales map to underlying physical coupling mechanisms. The method is evaluated on an IEEE 39-bus Real-Time Digital Simulator (RTDS) testbed with three converter-interfaced AI data center loads driven by synthetic workload profiles. A global analysis attributes the dominant correlation energy to a slow thermal band, and a sliding-window analysis identifies brief intensification events in a small fraction of windows that align with stochastic workload coincidences. Cross-validation with RTDS voltage coherence confirms elevated coupling during these intervals. The proposed modal growth indicator provides an early-warning signal of correlation intensification, with a lead of of about 4~s before pairwise coherence reaches its peak.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用仿真建模和情景分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Chandan Chaudhary, Michael Murillo, Mohammed Ben-Idris, 等. Modal Analysis of Spatial Load Correlation in AI Data Center-Dominated Power Systems[J/OL]. (2026-06-12)[2026-07-03]. http://arxiv.org/abs/2606.13847v2.

Full text 中文海报

Research Article余热回收

Data Center Life Cycle Co-Design Optimization

Shrenik Jadhav、Vidhyashree Nagaraju、Zheng Liu

Published 2026-06-14 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

Liquid cooled supercomputers dissipate tens of megawatts of waste heat through cooling plants organized as parallel subloops that serve coolant distribution units. The number of subloops and the assignment of units to them are design decisions fixed at construction, yet they have not been systematically optimized for facilities at this scale. As electricity grids decarbonize, embodied carbon becomes a larger share of facility life cycle emissions and the cost of an unnecessary subloop becomes harder to justify. We present a framework that integrates operational energy from a validated control optimizer based on sequential least squares programming, embodied carbon from a bill of materials, and expected unplanned downtime from a per subloop reliability model. The framework is applied to the Frontier supercomputer, evaluating all 611 ways of partitioning its 25 coolant distribution units into two through six subloops. The life cycle cost and carbon optimum is found at two subloops holding 14 and 11 units, achieving 3,320.7 tonnes of carbon dioxide equivalent and $3.99 million over a seven year horizon, a saving of 50.2 tonnes and $100,000 compared to built four subloop configuration. The optimum remains on the Pareto front in all 15 scenarios of a one at a time sensitivity sweep. A semi-analytical decision rule generalizes the result, predicting four subloops for Aurora, two for El Capitan, and one for LUMI. When reliability is treated as a hard constraint set by operations policy, the four subloop Frontier deployment is consistent with the constrained optimum.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，余热回收、热泵耦合和二次能源利用正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断数据中心余热能否从成本项转化为能源资产。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Shrenik Jadhav, Vidhyashree Nagaraju, Zheng Liu. Data Center Life Cycle Co-Design Optimization[J/OL]. (2026-06-14)[2026-07-03]. http://arxiv.org/abs/2606.15408v1.

Full text 中文海报

Research ArticleAI 运维优化

Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers

Zihan Yu、Xianling Zeng、Zhiming Xue、Yalun Qi、Sichen Zhao

Published 2026-06-19 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth of large-scale AI workloads, particularly Large Language Model (LLM) training and inference, is fundamentally reshaping the operational dynamics of hyperscale data centers. Unlike traditional cloud workloads, AI-driven jobs exhibit bursty, high-intensity, and rapidly shifting resource demands, often leading to sudden capacity stress that cannot be effectively handled by reactive threshold-based mechanisms. In this paper, we propose a deployment-oriented, burst-aware early warning framework for proactive capacity stress prediction under AI workload surges. We formulate the problem as a high-recall forecasting task over multivariate telemetry windows, with the explicit goal of enabling operational intervention before system degradation occurs. The proposed framework integrates workload intensity, temporal variation, and system pressure signals, and employs a lightweight tree-based learning model to capture nonlinear interactions in highly imbalanced environments. To evaluate the system under realistic conditions, we introduce an AI workload surge injection methodology that simulates burst-driven demand patterns observed in large-scale AI systems. Our XGBoost-based model achieves an ROC AUC of 0.697 and an AP of 0.670, significantly outperforming baseline methods. Under deployment-oriented threshold selection, the framework achieves a Recall of 0.914, enabling the detection of the majority of stress-prone periods with acceptable false-alarm cost. Beyond predictive performance, we show how the proposed framework can be integrated into operational control loops to support proactive actions such as workload throttling and resource scaling. Our results highlight the practical value of high-recall, learning-based early warning systems in enabling resilient and adaptive data center operations in the era of AI-driven workloads.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用框架构建和频域/系统级分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Zihan Yu, Xianling Zeng, Zhiming Xue, 等. Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers[J/OL]. (2026-06-19)[2026-07-03]. http://arxiv.org/abs/2606.21130v1.

Full text 中文海报

Research Article芯片与算力

Space-CIM: Enabling Compute-In-Memory Accelerators for Thermally-Constrained Space Platforms

Sohan Salahuddin Mugdho、Md. Shahedul Hasan、Cheng Wang

Published 2026-06-04 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth in compute demand from artificial intelligence (AI) has driven a massive surge in data center construction, precipitating an energy and sustainability crisis. Motivated by the abundant solar energy in outer space and the recent sharp reduction in space launch costs, orbital data centers are emerging as a potential pathway for the future scaling of AI compute infrastructure. While the cold background in vacuum seems appealing for cooling, computing systems operating in space without convection ultimately rely on radiative cooling, requiring large-area radiators. Such limitations in thermal management pose a significant challenge for deploying the standard liquid/air-cooled computers in space. In this work, we investigate the impact of the thermal constraints in space on both graphics processing units (GPUs) with high-bandwidth memory (HBM) and the emerging compute-in-memory (CIM) accelerators. We develop a radiator-in-the-loop co-design methodology that directly links the permitted system TOPS (terra-operations per second) with the practical radiator cooling capacity in space. Our thermal simulations reveal that the separately located GPU die and HBMs create severe thermal hotspots under limited radiator capacity, necessitating GPU thermal throttling. In contrast, CIM accelerators exhibit a much more uniform heat distribution and consistently outperform GPUs in TOPS/W across a wide range of radiator budgets. We systematically evaluated the performance of CIM and GPU across various AI workloads and demonstrated that CIM has a magnified advantage for deployment in space under realistic thermal constraints.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Sohan Salahuddin Mugdho, Md. Shahedul Hasan, Cheng Wang. Space-CIM: Enabling Compute-In-Memory Accelerators for Thermally-Constrained Space Platforms[J/OL]. (2026-06-04)[2026-07-03]. http://arxiv.org/abs/2606.05741v1.

Full text 中文海报

Research Article算电协同

AI Data Centers and Power System Sustainability: Understanding the Sustainability Implications of AI-Driven Data Centers on Power Systems

Yuhao Huang、Novarun Deb、Hamidreza Zareipour

Published 2026-06-19 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid expansion of artificial intelligence (AI) has driven unprecedented growth in data center electricity demand. The scale and pace of this load growth carry significant implications for the sustainability of electric power systems. On the one hand, rapid, spatially concentrated data center load growth is outpacing clean energy deployment in several major regions, raising emissions and challenging both grid flexibility and reliability. On the other hand, this fast-developing and capital-intensive sector offers abundant opportunities to advance sustainability through clean energy integration and operational innovations. This article provides an overview of the mechanisms through which data center affect power system sustainability, underscoring both risks and the potential. Specifically, this article (i) characterizes AI data center load behavior and categorizes electricity supply configurations by function and sustainability profile, as well as situates these loads within global and regional electricity demand trends; (ii) analyzes sustainability impacts across short-run operational and long-run planning mechanisms, evaluates effects on grid carbon emissions and renewable energy utilization, and feasibility of offering system flexibility and participating in ancillary service; and (iii) evaluates real-world corporate sustainability pathways and highlighting both the system benefits and feasibility limits of current carbon accounting practices. The goal of this work is to synthesize existing knowledge and technological developments and to guide research and development toward a more sustainable integration of AI data centers and electric power systems.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用文献摘要中的模型、实验或案例分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Yuhao Huang, Novarun Deb, Hamidreza Zareipour. AI Data Centers and Power System Sustainability: Understanding the Sustainability Implications of AI-Driven Data Centers on Power Systems[J/OL]. (2026-06-19)[2026-07-03]. http://arxiv.org/abs/2606.21064v1.

Full text 中文海报

Research Article算电协同

Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute

Chris Williams、Philip Colangelo、Ayse Coskun、Ethan Levine、Andy Neale、Ciaran Roberts、Shayan Sengupta、Nikhil Shirolkar

Published 2026-06-24 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through software-based workload orchestration. This article explores how modern GPU-based AI data centers can operate as grid-interactive assets that respond dynamically to power system conditions. We describe an architecture integrating grid signals, workload scheduling, and power telemetry for fine-grained cluster power control. Experimental results from a real-world deployment on a 130 kW GPU cluster demonstrate multiple forms of flexibility, including rapid load reduction, sustained curtailment, and carbon-aware operation while preserving service levels for priority jobs. We further demonstrate performance-aware load shifting across geographically distributed clusters, enabling workloads to migrate toward regions with lower grid stress. Together, these capabilities transform AI infrastructure from static electricity consumers into flexible resources that support grid reliability, accelerate interconnection, and improve computing sustainability.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用建模优化、调度分析或算法评估，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Chris Williams, Philip Colangelo, Ayse Coskun, 等. Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute[J/OL]. (2026-06-24)[2026-07-03]. http://arxiv.org/abs/2606.25098v1.

Full text 中文海报

智算中心论文专站

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献