智算中心论文专站

AIDC Research Papers

Liquid Cooling AI Data Center Power & Thermal Systems
Current Issue

Volume 2026 · Issue 07-05

按期刊卷期页方式整理本期论文。每条仅使用日报已列出的可追溯公开来源,不新增未经核验事实。

Research Article算电协同

Grid-Interactive Thermal Management of AI Data Centers via Contextual Distributionally Robust Optimization

Jiachen Shen、Jian Shi、Yijie Yang、Chenye Wu、Dan Wang、Ju Bin Song、Zhu Han

Published 2026-07-01 · arXiv · Credibility S

Thermal management in AI data centers is increasingly challenged by bursty workloads and uncertain heat generation. To prevent thermal violations, existing cooling strategies either enforce conservative, rigid bounds that severely limit grid responsiveness, or rely on forecast-driven controllers that perform poorly under AI workload uncertainty and distribution shifts. To overcome the above challenges, this paper pr…

Abstract, interpretation and reference

Abstract

Thermal management in AI data centers is increasingly challenged by bursty workloads and uncertain heat generation. To prevent thermal violations, existing cooling strategies either enforce conservative, rigid bounds that severely limit grid responsiveness, or rely on forecast-driven controllers that perform poorly under AI workload uncertainty and distribution shifts. To overcome the above challenges, this paper proposes a Contextual Distributionally Robust Optimization (CDRO) framework for grid-interactive cooling control. Unlike standard DRO with fixed ambiguity sets, the proposed approach dynamically adapts the Wasserstein radius using real-time AI and grid context. This safely shrinks uncertainty bounds during stable regimes, unlocking deep demand-side flexibility. Theoretically, we formulate the control as an infinite-dimensional inf-sup problem, derive an exact tractable reformulation for the Wasserstein worst-case expected-cost term, and then derive a tractable conservative deterministic counterpart for the Distributionally Robust Conditional Value at Risk (DR-CVaR) thermal safety constraint. Solved via a scalable nested Alternating Direction Method of Multipliers (ADMM) algorithm, the CDRO controller achieves near-zero thermal violations under extreme workload spikes in high-fidelity EnergyPlus co-simulations. Simultaneously, it reduces the operational cost premium of robustness by approximately 13.7 percentage points relative to standard Min-Max Model Predictive Control (MPC).

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Jiachen Shen, Jian Shi, Yijie Yang, 等. Grid-Interactive Thermal Management of AI Data Centers via Contextual Distributionally Robust Optimization[J/OL]. (2026-07-01)[2026-07-05]. http://arxiv.org/abs/2607.00099v1.

Full text 中文海报
算电协同 论文图示
Research Article热管理与液冷

Financing Artificial Intelligence Infrastructure: Mapping AI Infrastructure Investment and Compute Governance Across Africa

Kai-Hsin Hung、Sumaya Nur Adan、Krupa Suchak、Armita Sadeghian Barzoki、Kofi Yeboah、Mohammad Amir Anwar

Published 2026-06-24 · arXiv · Credibility S

Artificial intelligence depends on large-scale compute resources and their supporting infrastructure. However, AI governance debates treat compute primarily as a technical input rather than as an outcome of investment, ownership, and financial control. This paper examines AI infrastructure investment flows across Africa through a systematic analysis of 46 publicly announced projects totalling USD $12.7 billion betwe…

Abstract, interpretation and reference

Abstract

Artificial intelligence depends on large-scale compute resources and their supporting infrastructure. However, AI governance debates treat compute primarily as a technical input rather than as an outcome of investment, ownership, and financial control. This paper examines AI infrastructure investment flows across Africa through a systematic analysis of 46 publicly announced projects totalling USD $12.7 billion between 2019 and 2025. Using a value chain framework, we analyze who invests in AI-relevant infrastructure and where investments concentrate. Our findings reveal a highly concentrated landscape dominated by global data center operators, hyperscale technology firms, and development finance institutions, clustering in South Africa, Kenya, Nigeria, and Egypt. We introduce asymmetrical interdependence to describe a structural condition in which capital and physical infrastructure account for 73% of total funding while control remains concentrated in the compute layer among a small number of global technology firms. We argue that compute governance must account for capital flows, ownership, and control, not only geographic access, because these dynamics shape AI compute equity. Infrastructure presence is necessary but insufficient for meaningful governance capacity.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向能效评价口径、运营指标和优化目标的系统化梳理。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Kai-Hsin Hung, Sumaya Nur Adan, Krupa Suchak, 等. Financing Artificial Intelligence Infrastructure: Mapping AI Infrastructure Investment and Compute Governance Across Africa[J/OL]. (2026-06-24)[2026-07-05]. http://arxiv.org/abs/2606.28404v1.

Full text 中文海报
热管理与液冷 论文图示
Research ArticleAI 运维优化

Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters

Shuyi Chen、Zhengchang Hua、Nikos Tziritas、Georgios Theodoropoulos

Published 2026-06-23 · arXiv · Credibility S

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10μ\text{s}$ communication latency required for synchronized scientific workloads, such as distributed Large Language Model (LLM) training, forces ODCs into extreme physical density, triggering a critical "Proximity-Thermal Paradox." As these hi…

Abstract, interpretation and reference

Abstract

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10μ\text{s}$ communication latency required for synchronized scientific workloads, such as distributed Large Language Model (LLM) training, forces ODCs into extreme physical density, triggering a critical "Proximity-Thermal Paradox." As these high-density systems scale into Monolithic Structures or Proximity Swarms, they suffer from intense thermal-fluid crosstalk (heat traps in shared cooling loops) and thermal-radiative crosstalk (mutual heating that blocks deep-space cooling radiators). If left unmitigated, this persistent heat stagnation not only triggers severe thermal throttling that degrades training throughput, but also induces severe thermal fatigue, drastically shortening hardware lifespans and generating premature space e-waste. To make orbital AI truly sustainable, this position paper challenges traditional uniform load-sharing. We propose the Thermal-Aware Heterogeneity Thesis, which treats spatial cooling variances as a primary resource management dimension. Building on this, we introduce Thermal-Load Balancing (TLB), a software framework that dynamically migrates these intensive workloads to the coolest available units based on instantaneous fluid temperatures or absorbed radiation. Our analysis demonstrates that TLB resolves thermal bottlenecks to restore Model Flops Utilization (MFU), while simultaneously reducing physical thermal stress. Extending the operational lifespan of orbital hardware is crucial to amortize the massive embodied carbon of rocket launches, outlining a necessary pathway to scale orbital AI without accelerating e-waste.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,AI 运维、负载预测和设施调优正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断AI 工具是否能降低运维复杂度并提升可用性。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Shuyi Chen, Zhengchang Hua, Nikos Tziritas, 等. Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters[J/OL]. (2026-06-23)[2026-07-05]. http://arxiv.org/abs/2606.26150v2.

Full text 中文海报
AI 运维优化 论文图示
Research Article算电协同

A Bilevel Framework for Data Center-Grid Coordination with DLMPs in Unbalanced Three-Phase Distribution Systems

Arash Baharvandi、Duong Tung Nguyen

Published 2026-06-25 · arXiv · Credibility S

This paper proposes a grid-aware coordination framework between data centers and distribution grids using a DLMP-based bilevel optimization model. The data center aggregator (DCA) determines active power demand in response to distribution locational marginal prices (DLMPs), while the distribution system operator (DSO) solves a network-constrained optimal power flow problem to determine DLMPs in an unbalanced three-p…

Abstract, interpretation and reference

Abstract

This paper proposes a grid-aware coordination framework between data centers and distribution grids using a DLMP-based bilevel optimization model. The data center aggregator (DCA) determines active power demand in response to distribution locational marginal prices (DLMPs), while the distribution system operator (DSO) solves a network-constrained optimal power flow problem to determine DLMPs in an unbalanced three-phase system. The model incorporates both active and reactive power consumption of data centers to evaluate their impacts on voltage regulation and phase imbalance. To mitigate adverse network effects, two operating cases are analyzed: without reactive power compensation and with static var generator (SVG)-based compensation. The proposed approach is validated on the IEEE 37-bus unbalanced distribution test system. Simulation results show that DLMP-based coordination captures economically efficient data center operation, and phase- and location-dependent network conditions, while SVG-based compensation improves voltage profiles and reduces phase unbalance.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Arash Baharvandi, Duong Tung Nguyen. A Bilevel Framework for Data Center-Grid Coordination with DLMPs in Unbalanced Three-Phase Distribution Systems[J/OL]. (2026-06-25)[2026-07-05]. http://arxiv.org/abs/2606.26328v1.

Full text 中文海报
算电协同 论文图示
Research Article热管理与液冷

AI Data Centers and the Water Use Feedback Loop

Basit A. Akinade、Amobichukwu C. Amanambu、Jonathan M. Frame、Shaolei Ren

Published 2026-06-20 · arXiv · Credibility S

AI data centres consume water for cooling, water scarcity constrains siting, and AI tools can improve water system efficiency. These dynamics are studied separately yet form a feedback loop. This review formalises the Water and AI Feedback Loop, introduces the Water Consumption Impact index to quantify community-scale utility burden, and demonstrates across ten US sites that burden spans three orders of magnitude, f…

Abstract, interpretation and reference

Abstract

AI data centres consume water for cooling, water scarcity constrains siting, and AI tools can improve water system efficiency. These dynamics are studied separately yet form a feedback loop. This review formalises the Water and AI Feedback Loop, introduces the Water Consumption Impact index to quantify community-scale utility burden, and demonstrates across ten US sites that burden spans three orders of magnitude, from 0.2% to 134% of host capacity.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,液冷、热管理和数据中心能效正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向冷却效率、能源利用或运维策略的改进方向。意义:对日报读者而言,它可用于判断液冷方案、热管理路线和高密度部署节奏。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Basit A. Akinade, Amobichukwu C. Amanambu, Jonathan M. Frame, 等. AI Data Centers and the Water Use Feedback Loop[J/OL]. (2026-06-20)[2026-07-05]. http://arxiv.org/abs/2606.21760v1.

Full text 中文海报
热管理与液冷 论文图示
Research Article算电协同

GaN Power Devices and Converter Architectures for AI Data Centers: Efficiency, Reliability, and Deployment Pathways

Donald Intal、Abasifreke Ebong

Published 2026-06-24 · arXiv · Credibility S

The growth of artificial-intelligence workloads is increasing the electrical and thermal demands on data-center power-delivery systems, making conversion efficiency, power density, and reliability critical design priorities. This review examines how gallium-nitride (GaN) power devices can be matched to specific stages of the grid-to-load conversion chain, including power-factor correction, isolated DC/DC conversion,…

Abstract, interpretation and reference

Abstract

The growth of artificial-intelligence workloads is increasing the electrical and thermal demands on data-center power-delivery systems, making conversion efficiency, power density, and reliability critical design priorities. This review examines how gallium-nitride (GaN) power devices can be matched to specific stages of the grid-to-load conversion chain, including power-factor correction, isolated DC/DC conversion, 48-V intermediate-bus conversion, and point-of-load regulation. Si, SiC, and GaN are compared using converter-relevant metrics, and lateral, vertical, and specialized GaN architectures are evaluated in terms of voltage scalability, switching behavior, reverse conduction, thermal pathways, gate control, and technology maturity. The analysis shows that GaN provides a stage-dependent rather than universal advantage. Commercial lateral GaN HEMTs are particularly effective in high-frequency, low-to-mid-voltage stages, while specialized and hybrid devices support bidirectional operation, normally-off control, extreme conversion ratios, and integration. Vertical GaN remains an emerging option for higher-voltage and higher-power conversion. A quantitative framework links cascaded converter efficiency to electrical-loss reduction, cooling demand, annual facility energy use, and operational carbon emissions. Broad deployment further requires low-parasitic packaging, disciplined gate-drive and EMI co-design, mission-profile reliability qualification, scalable manufacturing, and supply-chain resilience. GaN is therefore best treated as a stage-specific system lever whose value depends on coordinated device, topology, package, and thermal co-design.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用综述归纳和指标比较,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Donald Intal, Abasifreke Ebong. GaN Power Devices and Converter Architectures for AI Data Centers: Efficiency, Reliability, and Deployment Pathways[J/OL]. (2026-06-24)[2026-07-05]. http://arxiv.org/abs/2606.25281v1.

Full text 中文海报
算电协同 论文图示
Research Article算电协同

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

Bojun Du、Xiaoyi Fan、Ershun Du、Long Chen、Jianpei Han、Qingchun Hou、Ning Zhang、Chongqing Kang

Published 2026-06-17 · arXiv · Credibility S

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregat…

Abstract, interpretation and reference

Abstract

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用建模优化、调度分析或算法评估,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Bojun Du, Xiaoyi Fan, Ershun Du, 等. From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads[J/OL]. (2026-06-17)[2026-07-05]. http://arxiv.org/abs/2606.18851v1.

Full text 中文海报
算电协同 论文图示
Research Article算电协同

Spatial Load Correlation in AI Data-Center-Dominated Power Systems

Chandan Chaudhary、Alaaeldein Abdelkader、Yansong Pei、Mohammed Benidris、Joydeep Mitra

Published 2026-06-12 · arXiv · Credibility S

The proliferation of large-scale data centers introduces spatially correlated demand profiles that challenge the long-standing assumption of statistical independence of loads in power system analysis. This paper examines the emergence of such load correlations and evaluates their impact on data-center-dominated grids. Analytical derivations reveal that correlated load fluctuations amplify aggregate stochastic distur…

Abstract, interpretation and reference

Abstract

The proliferation of large-scale data centers introduces spatially correlated demand profiles that challenge the long-standing assumption of statistical independence of loads in power system analysis. This paper examines the emergence of such load correlations and evaluates their impact on data-center-dominated grids. Analytical derivations reveal that correlated load fluctuations amplify aggregate stochastic disturbances, reduce voltage stability margins through weakened reactive power stiffness, and degrade frequency stability margin by erosion of natural load diversity effects. Real-time digital simulation studies confirm that moderate spatial correlation in distributed data centers produces simultaneous frequency deviations and voltage fluctuations across multiple buses. The findings offer transmission system operators a physics-based perspective to interpret emerging oscillatory phenomena and establish stability planning criteria grounded in measurable load-correlation structures rather than traditional diversity assumptions.

中文解读

背景:AI 数据中心负载、功率密度和能源约束同步上升,算力负载与电网侧资源的协同调度正在成为智算中心设计的关键变量。问题:论文聚焦现有方案在效率、可靠性或工程协同上的瓶颈。方法:摘要显示作者采用仿真建模和情景分析,把运行负载、冷却/能源系统和基础设施约束放在同一分析框架中。结果:研究重点指向AI 负载波动对电网设备寿命和调频边界的影响。意义:对日报读者而言,它可用于判断智算中心建设是否受电网容量、负载波动和调度机制约束。仍需结合全文实验条件、样本范围和成本假设核验。

参考文献

Chandan Chaudhary, Alaaeldein Abdelkader, Yansong Pei, 等. Spatial Load Correlation in AI Data-Center-Dominated Power Systems[J/OL]. (2026-06-12)[2026-07-05]. http://arxiv.org/abs/2606.13853v1.

Full text 中文海报
算电协同 论文图示