智算中心论文观察｜2026-06-29

Current Issue

Volume 2026 · Issue 06-29

按期刊卷期页方式整理本期论文。每条仅使用日报已列出的可追溯公开来源，不新增未经核验事实。

Research Article算电协同

Contextual Robust Optimization for AI Data Center Scheduling with Statistical Guarantees

Yijie Yang、Xi Weng、Yue Chen

Published 2026-06-16 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth of AI workloads is substantially increasing data center electricity demand and carbon emissions, motivating the development of carbon-aware scheduling methods. However, effective scheduling is challenging because renewable generation and AI workloads are subject to forecast errors, while training and inference workloads exhibit heterogeneity in computational characteristics. This paper proposes a contextual robust optimization framework for AI data center operation. The proposed model explicitly captures the heterogeneous computational characteristics of AI training and inference workloads. To deal with renewable generation and workload forecast errors, we develop loss-based uncertainty learning models that directly map contextual features to covariate-dependent uncertainty sets. The resulting contextual joint chance-constrained scheduling problem is reformulated into a tractable robust optimization problem, and a calibration algorithm is developed to provide finite-sample probabilistic feasibility guarantees for multiple joint chance constraints. Numerical experiments based on real-world AI workload traces and renewable generation data show that the proposed method reduces operating costs by an average of 5.57% compared to benchmark methods while maintaining reliable feasibility and strong computational scalability.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用建模优化、调度分析或算法评估，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Yijie Yang, Xi Weng, Yue Chen. Contextual Robust Optimization for AI Data Center Scheduling with Statistical Guarantees[J/OL]. (2026-06-16)[2026-06-29]. http://arxiv.org/abs/2606.17466v1.

Full text 中文海报

Research ArticleAI 运维优化

Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers

Zihan Yu、Xianling Zeng、Zhiming Xue、Yalun Qi、Sichen Zhao

Published 2026-06-19 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth of large-scale AI workloads, particularly Large Language Model (LLM) training and inference, is fundamentally reshaping the operational dynamics of hyperscale data centers. Unlike traditional cloud workloads, AI-driven jobs exhibit bursty, high-intensity, and rapidly shifting resource demands, often leading to sudden capacity stress that cannot be effectively handled by reactive threshold-based mechanisms. In this paper, we propose a deployment-oriented, burst-aware early warning framework for proactive capacity stress prediction under AI workload surges. We formulate the problem as a high-recall forecasting task over multivariate telemetry windows, with the explicit goal of enabling operational intervention before system degradation occurs. The proposed framework integrates workload intensity, temporal variation, and system pressure signals, and employs a lightweight tree-based learning model to capture nonlinear interactions in highly imbalanced environments. To evaluate the system under realistic conditions, we introduce an AI workload surge injection methodology that simulates burst-driven demand patterns observed in large-scale AI systems. Our XGBoost-based model achieves an ROC AUC of 0.697 and an AP of 0.670, significantly outperforming baseline methods. Under deployment-oriented threshold selection, the framework achieves a Recall of 0.914, enabling the detection of the majority of stress-prone periods with acceptable false-alarm cost. Beyond predictive performance, we show how the proposed framework can be integrated into operational control loops to support proactive actions such as workload throttling and resource scaling. Our results highlight the practical value of high-recall, learning-based early warning systems in enabling resilient and adaptive data center operations in the era of AI-driven workloads.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用框架构建和频域/系统级分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Zihan Yu, Xianling Zeng, Zhiming Xue, 等. Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers[J/OL]. (2026-06-19)[2026-06-29]. http://arxiv.org/abs/2606.21130v1.

Full text 中文海报

Research Article余热回收

Data Center Life Cycle Co-Design Optimization

Shrenik Jadhav、Vidhyashree Nagaraju、Zheng Liu

Published 2026-06-14 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

Liquid cooled supercomputers dissipate tens of megawatts of waste heat through cooling plants organized as parallel subloops that serve coolant distribution units. The number of subloops and the assignment of units to them are design decisions fixed at construction, yet they have not been systematically optimized for facilities at this scale. As electricity grids decarbonize, embodied carbon becomes a larger share of facility life cycle emissions and the cost of an unnecessary subloop becomes harder to justify. We present a framework that integrates operational energy from a validated control optimizer based on sequential least squares programming, embodied carbon from a bill of materials, and expected unplanned downtime from a per subloop reliability model. The framework is applied to the Frontier supercomputer, evaluating all 611 ways of partitioning its 25 coolant distribution units into two through six subloops. The life cycle cost and carbon optimum is found at two subloops holding 14 and 11 units, achieving 3,320.7 tonnes of carbon dioxide equivalent and $3.99 million over a seven year horizon, a saving of 50.2 tonnes and $100,000 compared to built four subloop configuration. The optimum remains on the Pareto front in all 15 scenarios of a one at a time sensitivity sweep. A semi-analytical decision rule generalizes the result, predicting four subloops for Aurora, two for El Capitan, and one for LUMI. When reliability is treated as a hard constraint set by operations policy, the four subloop Frontier deployment is consistent with the constrained optimum.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，余热回收、热泵耦合和二次能源利用正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断数据中心余热能否从成本项转化为能源资产。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Shrenik Jadhav, Vidhyashree Nagaraju, Zheng Liu. Data Center Life Cycle Co-Design Optimization[J/OL]. (2026-06-14)[2026-06-29]. http://arxiv.org/abs/2606.15408v1.

Full text 中文海报

Research Article芯片与算力

Space-CIM: Enabling Compute-In-Memory Accelerators for Thermally-Constrained Space Platforms

Sohan Salahuddin Mugdho、Md. Shahedul Hasan、Cheng Wang

Published 2026-06-04 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth in compute demand from artificial intelligence (AI) has driven a massive surge in data center construction, precipitating an energy and sustainability crisis. Motivated by the abundant solar energy in outer space and the recent sharp reduction in space launch costs, orbital data centers are emerging as a potential pathway for the future scaling of AI compute infrastructure. While the cold background in vacuum seems appealing for cooling, computing systems operating in space without convection ultimately rely on radiative cooling, requiring large-area radiators. Such limitations in thermal management pose a significant challenge for deploying the standard liquid/air-cooled computers in space. In this work, we investigate the impact of the thermal constraints in space on both graphics processing units (GPUs) with high-bandwidth memory (HBM) and the emerging compute-in-memory (CIM) accelerators. We develop a radiator-in-the-loop co-design methodology that directly links the permitted system TOPS (terra-operations per second) with the practical radiator cooling capacity in space. Our thermal simulations reveal that the separately located GPU die and HBMs create severe thermal hotspots under limited radiator capacity, necessitating GPU thermal throttling. In contrast, CIM accelerators exhibit a much more uniform heat distribution and consistently outperform GPUs in TOPS/W across a wide range of radiator budgets. We systematically evaluated the performance of CIM and GPU across various AI workloads and demonstrated that CIM has a magnified advantage for deployment in space under realistic thermal constraints.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用综述归纳和指标比较，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义：对日报读者而言，它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Sohan Salahuddin Mugdho, Md. Shahedul Hasan, Cheng Wang. Space-CIM: Enabling Compute-In-Memory Accelerators for Thermally-Constrained Space Platforms[J/OL]. (2026-06-04)[2026-06-29]. http://arxiv.org/abs/2606.05741v1.

Full text 中文海报

Research Article热管理与液冷

Maximizing Compute Capacity in AI Data Centers through Cooling, Energy Storage, and Computing Adaptation

Shaolei Ren、Mohammad A. Islam、Adam Wierman

Published 2026-05-30 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for multiple hours per day. Therefore, maximizing compute capacity under a limited site-level power budget is an important planning and operational challenge. Sizing the compute system conservatively based on peak cooling power can leave part of the site-level power capacity underutilized when the cooling power is below its peak, particularly in cooler months. On the other hand, sizing the compute system aggressively based on low cooling power can cause the total site-level power demand to exceed the site-level power capacity during hot days in the summer. This paper proposes ComputeAmp (Compute Amplifier), a framework that maximizes the compute capacity by jointly and dynamically leveraging cooling, battery energy storage, and computing-based adaptation. We discuss the opportunities and limitations of ComputeAmp and illustrate its potential to significantly expand usable compute capacity within local power and water resource limits. We also present a problem formulation for ComputeAmp and highlight a few algorithmic and operational challenges.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用框架构建和频域/系统级分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向冷却效率、能源利用或运维策略的改进方向。意义：对日报读者而言，它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Shaolei Ren, Mohammad A. Islam, Adam Wierman. Maximizing Compute Capacity in AI Data Centers through Cooling, Energy Storage, and Computing Adaptation[J/OL]. (2026-05-30)[2026-06-29]. http://arxiv.org/abs/2606.00457v1.

Full text 中文海报

Research Article算电协同

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

Bojun Du、Xiaoyi Fan、Ershun Du、Long Chen、Jianpei Han、Qingchun Hou、Ning Zhang、Chongqing Kang

Published 2026-06-17 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用建模优化、调度分析或算法评估，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Bojun Du, Xiaoyi Fan, Ershun Du, 等. From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads[J/OL]. (2026-06-17)[2026-06-29]. http://arxiv.org/abs/2606.18851v1.

Full text 中文海报

Research Article热管理与液冷

Hosting Capacity Assessment and Enhancement for Edge Data Centers in Active Distribution Networks

Linhan Fang、Xingpeng Li

Published 2026-05-31 · arXiv · Credibility S

With the increasing demand for edge computing and AI

Abstract, interpretation and reference

Abstract

With the increasing demand for edge computing and AI

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用文献摘要中的模型、实验或案例分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向冷却效率、能源利用或运维策略的改进方向。意义：对日报读者而言，它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Linhan Fang, Xingpeng Li. Hosting Capacity Assessment and Enhancement for Edge Data Centers in Active Distribution Networks[J/OL]. (2026-05-31)[2026-06-29]. https://arxiv.org/abs/2606.01407.

Full text 中文海报

Research Article算电协同

Peer-to-Peer Cloud Service Market for Data Centers Oriented to Computation-Electricity Coordination

Yugui Liu、Yibo Ding、Xudong Li、Jing Qu、Wenyi Zhang、Tong Qian、Wuyou Xiao、Zhengyang Hu

Published 2026-06-03 · arXiv · Credibility S

Abstract, interpretation and reference

Abstract

Energy-intensive data centers (DCs) have emerged as substantial and flexible loads in modern power systems, underscoring the critical need for computation-electricity coordination. Harnessing the spatio-temporal flexibility of DC workloads is a promising approach to facilitate this coordination. However, existing studies overlook the collaborative potential of computational resource sharing among geo-distributed DCs, thereby failing to fully unlock this flexibility. In this paper, a bi-level computation-electricity coordination framework is proposed to explicitly capture the bidirectional interactions between DCs and power grid. Firstly, a peer-to-peer cloud service market (P2P-CSM) for geo-distributed DCs is proposed, which enables bilateral cloud service transactions to leverage regional heterogeneities (e.g., electricity prices, cooling efficiency). Secondly, locational marginal prices are embedded into the framework to reflect network congestion and nodal price disparities. Thirdly, a dual consensus alternating direction method of multipliers (ADMM)-based decentralized algorithm is developed as the P2P market clearing algorithm, and a bisection-assisted iterative algorithm is proposed to ensure rigorous convergence of the framework. Case studies conducted on modified IEEE 30-bus system validate that the P2P-CSM achieves a win-win computation-electricity coordination: it not only increases total DC operational profit by 22.8\%, but also effectively alleviates grid congestion and yields a 3.2\% reduction in total energy consumption.

中文解读

背景：AI 数据中心负载、功率密度和能源约束同步上升，算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题：论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法：摘要显示作者采用框架构建和频域/系统级分析，把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果：研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义：对日报读者而言，它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Yugui Liu, Yibo Ding, Xudong Li, 等. Peer-to-Peer Cloud Service Market for Data Centers Oriented to Computation-Electricity Coordination[J/OL]. (2026-06-03)[2026-06-29]. http://arxiv.org/abs/2606.04981v1.

Full text 中文海报

智算中心论文专站

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献

Abstract

中文解读

参考文献