Resume training runs from the exact step, even after mid-epoch failures on large multi-node clusters.