智算中心论文专站

AIDC Research Papers

Liquid Cooling AI Data Center Power & Thermal Systems
Current Issue

Volume 2026 · Issue 06-27

按期刊卷期页方式整理本期论文。每条仅使用日报已列出的可追溯公开来源,不新增未经核验事实。

Research Article算电协同

Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute

Chris Williams、Philip Colangelo、Ayse Coskun、Ethan Levine、Andy Neale、Ciaran Roberts、Shayan Sengupta、Nikhil Shirolkar

Published 2026-06-24 · arXiv · Credibility S

The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through s…

Abstract, interpretation and reference

Abstract

The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through software-based workload orchestration. This article explores how modern GPU-based AI data centers can operate as grid-interactive assets that respond dynamically to power system conditions. We describe an architecture integrating grid signals, workload scheduling, and power telemetry for fine-grained cluster power control. Experimental results from a real-world deployment on a 130 kW GPU cluster demonstrate multiple forms of flexibility, including rapid load reduction, sustained curtailment, and carbon-aware operation while preserving service levels for priority jobs. We further demonstrate performance-aware load shifting across geographically distributed clusters, enabling workloads to migrate toward regions with lower grid stress. Together, these capabilities transform AI infrastructure from static electricity consumers into flexible resources that support grid reliability, accelerate interconnection, and improve computing sustainability.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Chris Williams, Philip Colangelo, Ayse Coskun, 等. Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute[J/OL]. (2026-06-24)[2026-06-27]. http://arxiv.org/abs/2606.25098v1.

Full text 中文海报
算电协同 论文图示
Research Article热管理与液冷

Toward Next-Generation AI Data Centers: Power Delivery Architecture Shifts, Emerging Technologies, and Challenges

Sangwhee Lee、Rafal P. Wojda、Cheol-Hee Jo、Shuntaro Inoue、Pedro Ribeiro、Gui-Jia Su、Mostak Mohammad、Himel Barua

Published 2026-06-24 · arXiv · Credibility S

The rapid growth of AI workloads is driving unprecedented increases in data center power demand, current transients, and thermal stress, exposing fundamental limitations in traditional 48 V rack architectures, low-voltage AC distribution, and line-frequency transformer interfaces. This paper reviews the three stages of architectural shifts required to support next-generation AI data centers and identifies three enab…

Abstract, interpretation and reference

Abstract

The rapid growth of AI workloads is driving unprecedented increases in data center power demand, current transients, and thermal stress, exposing fundamental limitations in traditional 48 V rack architectures, low-voltage AC distribution, and line-frequency transformer interfaces. This paper reviews the three stages of architectural shifts required to support next-generation AI data centers and identifies three enabling technological building blocks: high-voltage conversion-ratio DC/DC converters, facility-level low-voltage DC distribution, and medium-voltage solid-state transformers. The advantages, technical challenges, and potential solutions associated with each building block are reviewed. Finally, future research directions and open challenges are discussed.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Sangwhee Lee, Rafal P. Wojda, Cheol-Hee Jo, 等. Toward Next-Generation AI Data Centers: Power Delivery Architecture Shifts, Emerging Technologies, and Challenges[J/OL]. (2026-06-24)[2026-06-27]. http://arxiv.org/abs/2606.25095v1.

Full text 中文海报
热管理与液冷 论文图示
Research Article芯片与算力

Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2

Salvatore Cielo、Elmira Birang、Alexander Pöppl、Sajad Azizi、Plamen Dobrev、Margarita Egelhofer、Ivan Pribec、Gerald Mathias

Published 2026-06-22 · arXiv · Credibility S

We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, At…

Abstract, interpretation and reference

Abstract

We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, AthenaK), and finite-element PDE solvers from the dealii-X Center of Excellence. For each code we measure throughput and energy efficiency expressed as compute-elements per wall-clock second (or per Joule of consumed energy) on a single compute node, comparing CPU-only (SPR) against combined CPU+GPU (SPR+PVC) configurations where available. Energy measurements rely on lightweight code instrumentation with p3em, or the Energy Aware Runtime (EAR) present on the system. Our results show that GPU offload yields $4-12\times$ higher throughput and up to $15\times$ better energy efficiency compared to CPU-only execution, with lammps and AthenaK benefiting most. However, both throughput and energy gains are sensitive to problem granularity: insufficient work per GPU tile erodes the accelerator advantage, as clearly observed in AthenaK at small mesh-block sizes. The power-budget utilization is systematically lower for CPUs than it is for GPUs, indicating that even at peak useful-work rate, most applications running on CPUs leave a significant fraction of the node's thermal envelope unused.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Salvatore Cielo, Elmira Birang, Alexander Pöppl, 等. Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2[J/OL]. (2026-06-22)[2026-06-27]. http://arxiv.org/abs/2606.23265v1.

Full text 中文海报
芯片与算力 论文图示
Research ArticleAI 运维优化

Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters

Shuyi Chen、Zhengchang Hua、Nikos Tziritas、Georgios Theodoropoulos

Published 2026-06-23 · arXiv · Credibility S

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10μ\text{s}$ communication latency required for distributed Large Language Model (LLM) training forces ODCs into extreme physical density, triggering a critical "Proximity-Thermal Paradox." As these high-density systems scale into Monolithic Str…

Abstract, interpretation and reference

Abstract

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10μ\text{s}$ communication latency required for distributed Large Language Model (LLM) training forces ODCs into extreme physical density, triggering a critical "Proximity-Thermal Paradox." As these high-density systems scale into Monolithic Structures or Proximity Swarms, they suffer from intense thermal-fluid crosstalk (heat traps in shared cooling loops) and thermal-radiative crosstalk (mutual heating that blocks deep-space cooling radiators). If left unmitigated, this persistent heat stagnation not only triggers severe thermal throttling that degrades training throughput, but also induces severe thermal fatigue, drastically shortening hardware lifespans and generating premature space e-waste. To make orbital AI truly sustainable, this position paper challenges traditional uniform load-sharing. We propose the Thermal-Aware Heterogeneity Thesis, which treats spatial cooling variances as a primary resource management dimension. Building on this, we introduce Thermal-Load Balancing (TLB), a software framework that dynamically migrates LLM workloads to the coolest available units based on instantaneous fluid temperatures or absorbed radiation. Our analysis demonstrates that TLB resolves thermal bottlenecks to restore Model Flops Utilization (MFU), while simultaneously reducing physical thermal stress. Extending the operational lifespan of orbital hardware is crucial to amortize the massive embodied carbon of rocket launches, outlining a necessary pathway to scale orbital AI without accelerating e-waste.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Shuyi Chen, Zhengchang Hua, Nikos Tziritas, 等. Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters[J/OL]. (2026-06-23)[2026-06-27]. http://arxiv.org/abs/2606.26150v1.

Full text 中文海报
AI 运维优化 论文图示
Research Article算电协同

GaN Power Devices and Converter Architectures for AI Data Centers: Efficiency, Reliability, and Deployment Pathways

Donald Intal、Abasifreke Ebong

Published 2026-06-24 · arXiv · Credibility S

The growth of artificial-intelligence workloads is increasing the electrical and thermal demands on data-center power-delivery systems, making conversion efficiency, power density, and reliability critical design priorities. This review examines how gallium-nitride (GaN) power devices can be matched to specific stages of the grid-to-load conversion chain, including power-factor correction, isolated DC/DC conversion,…

Abstract, interpretation and reference

Abstract

The growth of artificial-intelligence workloads is increasing the electrical and thermal demands on data-center power-delivery systems, making conversion efficiency, power density, and reliability critical design priorities. This review examines how gallium-nitride (GaN) power devices can be matched to specific stages of the grid-to-load conversion chain, including power-factor correction, isolated DC/DC conversion, 48-V intermediate-bus conversion, and point-of-load regulation. Si, SiC, and GaN are compared using converter-relevant metrics, and lateral, vertical, and specialized GaN architectures are evaluated in terms of voltage scalability, switching behavior, reverse conduction, thermal pathways, gate control, and technology maturity. The analysis shows that GaN provides a stage-dependent rather than universal advantage. Commercial lateral GaN HEMTs are particularly effective in high-frequency, low-to-mid-voltage stages, while specialized and hybrid devices support bidirectional operation, normally-off control, extreme conversion ratios, and integration. Vertical GaN remains an emerging option for higher-voltage and higher-power conversion. A quantitative framework links cascaded converter efficiency to electrical-loss reduction, cooling demand, annual facility energy use, and operational carbon emissions. Broad deployment further requires low-parasitic packaging, disciplined gate-drive and EMI co-design, mission-profile reliability qualification, scalable manufacturing, and supply-chain resilience. GaN is therefore best treated as a stage-specific system lever whose value depends on coordinated device, topology, package, and thermal co-design.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Donald Intal, Abasifreke Ebong. GaN Power Devices and Converter Architectures for AI Data Centers: Efficiency, Reliability, and Deployment Pathways[J/OL]. (2026-06-24)[2026-06-27]. http://arxiv.org/abs/2606.25281v1.

Full text 中文海报
算电协同 论文图示
Research Article热管理与液冷

Power Optimization in Data Centres using Artificial Intelligence

N. Nandakumar、Research Scholar、N. Mahibanlindsay、Professor Head、C. Professor

Published 2026-06-03 · Semantic Scholar · Credibility S

Semantic Scholar 未提供可展示的原文摘要;请打开论文链接查看全文摘要。

Abstract, interpretation and reference

Abstract

Semantic Scholar 未提供可展示的原文摘要;请打开论文链接查看全文摘要。

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。摘要缺失,建议优先打开原文查看方法、数据和边界条件。

参考文献

N. Nandakumar, Research Scholar, N. Mahibanlindsay, 等. Power Optimization in Data Centres using Artificial Intelligence[J/OL]. 2026 7th International Conference on Inventive Research in Computing Applications (ICIRCA). (2026-06-03)[2026-06-27]. https://www.semanticscholar.org/paper/5b33a59bd51dfb893024c739ab73e404f2d42f5f.

Full text 中文海报
热管理与液冷 论文图示
Research Article芯片与算力

AI-on-Chip Systems: A Cross-Layer Review of Architectures, Interconnects, Design Automation, and Embedded Intelligence

Mohamed M. Morsy

Published 2026-06-15 · Semantic Scholar · Credibility S

The rapid growth of artificial intelligence (AI) workloads is reshaping semiconductor design across architecture, interconnect, memory hierarchy, packaging, timing, and design automation. Rather than converging on a single hardware solution, the field is expanding into a heterogeneous ecosystem that includes data-center graphics processing units (GPUs), edge neural processing units (NPUs), and application-specific i…

Abstract, interpretation and reference

Abstract

The rapid growth of artificial intelligence (AI) workloads is reshaping semiconductor design across architecture, interconnect, memory hierarchy, packaging, timing, and design automation. Rather than converging on a single hardware solution, the field is expanding into a heterogeneous ecosystem that includes data-center graphics processing units (GPUs), edge neural processing units (NPUs), and application-specific integrated circuits (ASICs), field-programmable gate array (FPGA)-based and hybrid AI system-on-chip (SoC) platforms, chiplet-enabled systems, and emerging beyond-conventional-silicon approaches such as photonic, neuromorphic, and analog in-memory processors. This paper presents a comprehensive review of AI-on-chip systems from a cross-layer perspective. It examines AI chip architectures and hardware platforms, network-on-chip (NoC) designs for AI communication patterns, and algorithm–hardware co-design methods for model acceleration, including compression, quantization, and sparsity-aware optimization. It also reviews clocking, synchronization, and clock-domain-crossing (CDC) challenges in large heterogeneous systems and chiplets, as well as manufacturing, advanced packaging, and reliability issues, including two-and-a-half-dimensional (2.5D) and three-dimensional (3D) integration, thermal and mechanical constraints, assembly quality, and long-term yield considerations. In parallel, the paper surveys the growing role of AI in chip design itself, covering machine-learning-assisted analysis, Bayesian and reinforcement-learning-based optimization, and the emerging use of large language models (LLMs) and AI agents for register-transfer level (RTL) generation, design-space exploration, and autonomous electronic design automation (EDA) workflows. Finally, it discusses beyond-silicon AI chip directions and the broader economic and industry context shaping cloud, on-premises, and edge deployment. By integrating these topics into a unified framework, this review highlights the key technological drivers, system-level tradeoffs, and future research directions that will define next-generation scalable, reliable, and energy-efficient AI-on-chip systems.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,芯片、服务器和高密度算力部署正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向跨地域数据中心负载与电力资源之间的调度关系。意义:对日报读者而言,它可用于判断芯片路线和服务器密度变化如何传导到机房设计。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Mohamed M. Morsy. AI-on-Chip Systems: A Cross-Layer Review of Architectures, Interconnects, Design Automation, and Embedded Intelligence[J/OL]. Electronics. (2026-06-15)[2026-06-27]. https://www.semanticscholar.org/paper/6559f17a3e4aaa83cbf55ab2f8c0657056399288.

Full text 中文海报
芯片与算力 论文图示
Research Article算电协同

A Bilevel Framework for Data Center-Grid Coordination with DLMPs in Unbalanced Three-Phase Distribution Systems

Arash Baharvandi、Duong Tung Nguyen

Published 2026-06-25 · arXiv · Credibility S

This paper proposes a grid-aware coordination framework between data centers and distribution grids using a DLMP-based bilevel optimization model. The data center aggregator (DCA) determines active power demand in response to distribution locational marginal prices (DLMPs), while the distribution system operator (DSO) solves a network-constrained optimal power flow problem to determine DLMPs in an unbalanced three-p…

Abstract, interpretation and reference

Abstract

This paper proposes a grid-aware coordination framework between data centers and distribution grids using a DLMP-based bilevel optimization model. The data center aggregator (DCA) determines active power demand in response to distribution locational marginal prices (DLMPs), while the distribution system operator (DSO) solves a network-constrained optimal power flow problem to determine DLMPs in an unbalanced three-phase system. The model incorporates both active and reactive power consumption of data centers to evaluate their impacts on voltage regulation and phase imbalance. To mitigate adverse network effects, two operating cases are analyzed: without reactive power compensation and with static var generator (SVG)-based compensation. The proposed approach is validated on the IEEE 37-bus unbalanced distribution test system. Simulation results show that DLMP-based coordination captures economically efficient data center operation, and phase- and location-dependent network conditions, while SVG-based compensation improves voltage profiles and reduces phase unbalance.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Arash Baharvandi, Duong Tung Nguyen. A Bilevel Framework for Data Center-Grid Coordination with DLMPs in Unbalanced Three-Phase Distribution Systems[J/OL]. (2026-06-25)[2026-06-27]. http://arxiv.org/abs/2606.26328v1.

Full text 中文海报
算电协同 论文图示