Category:
Resolution:
Resolved
We have observed that many PyTorch users frequently encounter random timeouts, which result in the termination of their jobs but leave the process running on the node. Consequently, this situation necessitates a reboot of the node.
Initially, we believed that this problem was associated with the Xid 119 error since we noticed similar error messages on all affected nodes. To address this issue, we upgraded the NVIDIA driver to version 525.105.17 as mentioned in the release notes. However, we have continued to observe the same problem. Currently, we are in the process of upgrading and testing a newer version of the NVIDIA driver.