cedana

Save, migrate and resume workloads.

Save a process, container or VM using Cedana to migrate the workload onto another instance with real-time performance and zero interruption or code-modifications.

HOW IT WORKS

Magically expand your favorite orchestration platforms

Works with Kueue (for HPC-style workloads), KServe (for inference), KubeFlow (for large-scale training), SLURM and more.

PERFORMANCE

Real-time compute orchestration.

Scale workloads and clusters up and down with higher performance, utilization and faster response times than previously available. Preempt and save workloads quickly to downscale resources without losing progress or performance.

ORCHESTRATION

Increase reliability
and availability.

Deliver best-in-class performance. We continuously optimize performance at the kernel, container, filesystem, network and interconnect layers. We deploy internal testing and simulation to thoroughly measure correctness, reliability and performance.

RELIABILITY

Use Cases

Maximize value and reliability with automated GPU orchestration.

20%-80% increase in utilization with GPU live migrations

Automatic workload failover
Zero-downtime OS/HW upgrades
Dynamically resize workloads onto optimal instances without interruption

Highest performance, fastest, lowest-cost inferencing.

2-10x faster time-to-first token
Dynamically resize workloads to optimal instances

Automatically reduce idle inferencing time
Use spot instances without interruption
Faster model hotswapping

Increase throughput, reliability and speed of advanced large model training

Real-time checkpoint/restore of multi-node systems

Automatic workload failover always preserves work in mini-batch

Fully transparent, no code modifications
Fine-grained system-level checkpointing

High availability and reliability, swap in GPUs and nodes on failure

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Increase GPU utilization with efficient hot swapping and bin-packing

Dynamic scaling for
- Larger models
- Increasing task complexity, context windows, and agent counts
- Variable workload demands

Persistent agent state

Improve the performance and reliability of your gaming infrastructure

Reduce latency by migrating workloads to player geographies

Load balance workloads to eliminate resource bottlenecks

Automated workload failover
Zero-downtime OS/HW upgrades.

Increase automation, throughput, and reliability of your HPC workloads.

Never lose work on long-running workloads in SLURM
Schedule, queue, and prioritize workloads across users and groups dynamically

20-80% lower compute costs
Increase workload throughput
Automate workflows conditionally based on time and success criteria

Get started

Play in the sandbox

We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.

Sandbox

Get a demo

Learn more about how Cedana is transforming compute orchestration and how we can help your organization.

Connect

API Reference & Guides

From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.

VIEW DOCS

Command your

compute_

Command your

compute_

Live Migration for CPU and GPU workloads

Save, migrate and resume workloads.

Magically expand your favorite orchestration platforms

Real-time compute orchestration.

Increase reliability
and availability.

Use Cases

Maximize value and reliability with automated GPU orchestration.

Highest performance, fastest, lowest-cost inferencing.

Increase throughput, reliability and speed of advanced large model training

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Improve the performance and reliability of your gaming infrastructure

Increase automation, throughput, and reliability of your HPC workloads.

Get started

Play in the sandbox

Get a demo

API Reference & Guides

command your compute_

Command your

compute_

Command your

compute_

Live Migration for CPU and GPU workloads

Save, migrate and resume workloads.

Magically expand your favorite orchestration platforms

Real-time compute orchestration.

Increase reliability and availability.

Use Cases

Maximize value and reliability with automated GPU orchestration.

Highest performance, fastest, lowest-cost inferencing.

Increase throughput, reliability and speed of advanced large model training

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Improve the performance and reliability of your gaming infrastructure

Increase automation, throughput, and reliability of your HPC workloads.

Get started

Play in the sandbox

Get a demo

API Reference & Guides

command your compute_

Increase reliability
and availability.