PyTorch jobs timeout and hanging
We have observed that many PyTorch users frequently encounter random timeouts, which result in the termination of their jobs but leave the process running on the node. Consequently, this situation necessitates a reboot of the node.