Snapshot and migrate your workloads across instances

Command your compute_

Backed by
Supports your cloud-native stack, tools, and   containers. For CPU and GPU workloads
NVIDIA Logo
Docker logo
Kubernetes logo
NVIDIA Logo
Helm charts
Terraform logo
Kata containers
Intel logo
Podman logo
AMD Logo
Docker logo
Kubernetes logo
Helm charts
Terraform logo
Kata containers
Intel logo
Podman logo

Why Cedana?

Cedana (/ce'dana/) is a save/migrate/resume system for compute. We leverage insight into the Linux Kernel (through CRIU and other methods) to checkpoint and restore workloads across instances and vendors.

Reduce compute costs by 20%-80%

Eliminate idle compute. Automatically suspends and resumes your workloads based on activity. Automatically bin-packs containers across instances, freeing up resources at fine-grain resolution.

Never lose work — even if hardware fails

Upon hardware or OOM failure, automatically resume workload on a new instance without losing work.


3x your performance

Accelerate cold start and time to first token by resuming your CPU/GPU workload from it's previous state. Eliminate boot time, initialization and other steps.

"We reduced our cloud cost by 50% by integrating Cedana's Save, Migrate, and Resume capability into our product. If an instance fails, we can continue workloads without losing work, increasing reliability."
Debo Ray DevZero
Debo Ray
CEO, DevZero

Use Cases

2-10x Faster Time to First Token
Drive customer experience gains and stickier customers
Conquer your cold starts: eliminate library/ model initialization and optimization time.
Fast image pulls: unlock network/storage efficiency, break registry bottlenecks
Maximize SLAs: seamlessly handle traffic surges and spikes
Reduce costs - reduce over-provisioning by over 2x
Eliminate training disruptions. Never lose work.
Drive customer experience gains and stickier customers
Relentless: Self-healing through GPU failures. no manual intervention.
Designed for large-scale training: automatically synchronizes multi-GPU and multi-node training jobs.
Checkpoint autopilot: Automatic versioning and intelligent storage and networking
Increase utilization: Less downtime, no loss of training progress
Kubernetes redefined for stateful workloads
Scale Smarter. Run Faster. Spend Less.
AI-driven observability:  auto-detects idle resources
Optimization that just works: Intelligent bin-packing and dynamically right-size pods
Seamless live-migration: pods, nodes, or entire clusters
Scale elastically and responsively: warm boots enable greater elasticity and faster performance
Higher SLAs, less costs: automatically resumes pods through catastrophic failures, while reducing over-provisioning
High Performance Computing
Accelerate time to insights while reducing TCO
A simple Helm chart provides a complete solution
Improve TCO: increase GPU utilization with self-healing workloads.
Efficient resource sharing across groups: faster draining and replenishing of nodes any time, without losing work
Checkpoint autopilot: automatic versioning and intelligent storage and networking

Maximize your CPU/GPU Utilization

See Cedana in Action
Request API Access

How it Works?

Save

Save a process or container using our API. Saves the complete state of the workload including process and filesystem state, open network connections, in-memory (RAM and VRAM), data, namespaces and everything in between

Migrate

Migrate the workload onto another instance.

Resume

Resume workloads as new process/container on another instance. Realtime performance with not service disruption.

Bin Packing

Use Save, Migrate, Resume (SMR) to implement policy-based automation. Cedana automatically suspends and resumes workloads based on activity, enabling fine-grained bin packing of containers. This saves up to 80% of compute costs.

Easy Integration

Use Cedana REST API to checkpoint your application’s state, transfer it to a new instance, cloud or resource, and resume operations. No code modifications needed.
curl -X POST -H "Content-Type: application/json" -d '{
 "checkpoint_data": {
   "container_name": "$CHECKPOINT_CONTAINER",
   "sandbox_name": "$CHECKPOINT_SANDBOX",
   "namespace": "$NAMESPACE",
   "checkpoint_path": "$CHECKPOINT_PATH",
   "root": "$ROOT"
 }
}' http://$CONTROLLER_URL:1324/checkpoint
curl -X POST -H "Content-Type: application/json" -d '{
 "checkpoint_data": {
   "container_name": "$CHECKPOINT_CONTAINER",
   "sandbox_name": "$CHECKPOINT_SANDBOX",
   "namespace": "$NAMESPACE",
   "checkpoint_path": "$CHECKPOINT_PATH",
   "root": "$ROOT"
 }
}' http://$CONTROLLER_URL:1324/restore

2-10x faster cold starts. Increase utilization up to 3x.  Automated, stateful reliability

See Cedana in Action
Request API Access