nvidia-smi
-
Script to Save Costs on Idle H100 Instances
Read Full Article: Script to Save Costs on Idle H100 InstancesIn the realm of machine learning research, the cost of running high-performance GPUs like the H100 can quickly add up, especially when instances are left idle. To address this, a simple yet effective daemon script was created to monitor GPU usage using nvidia-smi. The script detects when a training job has finished and, if the GPU remains idle for a configurable period (default is 20 minutes), it automatically shuts down the instance to prevent unnecessary costs. This solution, which is compatible with major cloud providers and open-sourced under the MIT license, offers a practical way to manage expenses by reducing idle time on expensive GPU resources. This matters because it helps researchers and developers save significant amounts of money on cloud computing costs.
