Cedana unifies an entire fleet of GPUs and CPUs as a single logical, shared cluster, and avoid any resource fragmentation or static reservation of capacity. Cedana can leverage spare capacity anywhere across the globe, transcending cluster, region, and workload boundaries (training vs. inference). This minimizes idle resources globally.
All jobs are safely preemptable. We automatically and continuously preserves stateful workloads, agnostic to workload or accelerator. This capability is transparent and seamlessly integrated into existing job schedulers, including Kubernetes and SLURM - unlike with traditional application checkpoints.
Live migrate GPU workloads before failures happen while system-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.
Assign individual jobs SLAs for reliability, costs and other criteria - required for efficiently sharing compute across users, groups and use-cases.
Improve security and availability, use confidential computing containers and VMs.
Manage training runs across clusters, both on-prem and in the cloud. Resume from system-level checkpoints on GPUs anywhere.
We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.
Learn more about how Cedana is transforming compute orchestration and how we can help your organization.
From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.