Deploy training jobs on SLURM or Kubernetes using components of choice, from KubeFlow to Ray. Keep your configurations, use Cedana to supercharge your cluster.
System-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.
Safely spin up and down training runs without needing to reconstruct state.
Manage training runs across clusters, both on-prem and in the cloud. Resume from system-level checkpoints on GPUs anywhere.