Command your

compute_

Command your

compute_

Live Migration for CPU and GPU workloads

Kubernetes & SLURM aware. Runs anywhere your compute does.

Save, migrate and resume workloads.

Save a process, container or VM using Cedana to migrate the workload onto another instance with real-time performance and zero interruption or code-modifications.

HOW IT WORKS

Magically expand your favorite orchestration platforms

Works with Kueue (for HPC-style workloads), KServe (for inference), KubeFlow (for large-scale training), SLURM and more.

PERFORMANCE

Real-time compute orchestration.

Scale workloads and clusters up and down with higher performance, utilization and faster response times than previously available. Preempt and save workloads quickly to downscale resources without losing progress or performance.

ORCHESTRATION

Increase reliability
and availability.

Deliver best-in-class performance. We continuously optimize performance at the kernel, container, filesystem, network and interconnect layers. We deploy internal testing and simulation to thoroughly measure correctness, reliability and performance.

RELIABILITY


Use Cases

Maximize value and reliability with automated GPU orchestration.

  • 20%-80% increase in utilization with GPU live migrations
  • Automatic workload failover
  • Zero-downtime OS/HW upgrades
  • Dynamically resize workloads onto optimal instances without interruption

Highest performance, fastest, lowest-cost inferencing.

  • 2-10x faster time-to-first token
  • Dynamically resize workloads to optimal instances
  • Automatically reduce idle inferencing time
  • Use spot instances without interruption
  • Faster model hotswapping

Increase throughput, reliability and speed of advanced large model training

  • Real-time checkpoint/restore of multi-node systems
  • Automatic workload failover always preserves work in mini-batch
  • Fully transparent, no code modifications
  • Fine-grained system-level checkpointing
  • High availability and reliability, swap in GPUs and nodes on failure

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

  • Increase GPU utilization with efficient hot swapping and bin-packing
  • Dynamic scaling for
    • Larger models
    • Increasing task complexity, context windows, and agent counts
    • Variable workload demands
  • Persistent agent state

Improve the performance and reliability of your gaming infrastructure

  • Reduce latency by migrating workloads to player geographies
  • Load balance workloads to eliminate resource bottlenecks
  • Automated workload failover
  • Zero-downtime OS/HW upgrades.

Increase automation, throughput, and reliability of your HPC workloads.

  • Never lose work on long-running workloads in SLURM
  • Schedule, queue, and prioritize workloads across users and groups dynamically
  • 20-80% lower compute costs
  • Increase workload throughput
  • Automate workflows conditionally based on time and success criteria

Get started

Play in the sandbox

We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.

Sandbox

Get a demo

Learn more about how Cedana is transforming compute orchestration and how we can help your organization.

Connect

API Reference & Guides

From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.

VIEW DOCS
Backers / Partners