Introduction and Project Summary

Predictive Maintenance for Liquid-Cooled Servers

Testbed Assembly

Cooling is a major part of reliability in data centers because it keeps servers up and running. Liquid cooling is becoming more common since it removes heat more efficiently than air. As these systems age, the earliest signs of degradation often appear as small deviations in temperature and power that fixed temperature alarms do not catch. If these changes go unnoticed, then can lead to expensive downtime.

AnomAIy is an ML model that learns how a server behaves under normal operating conditions and looks for subtle shifts that signal developing hardware issues. The project focuses on two types of degradation that affect the reliability of liquid-cooled servers.

The first is flow degradation. Over time, minerals and additives in the coolant can form deposits inside the loop. These deposits slowly restrict the flow of coolant, which makes it harder for the system to remove heat.

The second is VRM efficiency loss. The VRM is the component that supplies power to the CPU. When part of it starts to weaken, it becomes less efficient and wastes more power as heat. If the problem progresses, the VRM can eventually shut down and cut power to the CPU.

By identifying these issues early, AnomAIy aims to help prevent server downtime and extend hardware lifespan.