If you’ve ever trained a large AI model and had it fail with an error like: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12345, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for…
If you’ve ever trained a large AI model and had it fail with an error like: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12345, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for…