SkyPilot: An Intercloud Execution Layer

Sky Computing proposes a simple idea with large implications: treat the major clouds as interchangeable execution backends, and let a broker place jobs where performance, price, and availability are best at that moment. SkyPilot is one of the first practical systems built around this vision. It provides a job-level abstraction over AWS, GCP, Azure, and on-prem clusters and automates provisioning, failover, and data movement.

This post summarizes SkyPilot’s core design, where it fits, and the strengths and limitations you should understand before adopting it.

What SkyPilot Is

SkyPilot is an intercloud execution broker for batch and ML workloads. Instead of tying a job to a specific cloud, users define a task with resource requirements (e.g., “1× V100 or 1× TPU v3-8”), and SkyPilot decides:

  • where hardware is available
  • which region offers adequate capacity
  • which provider is most cost-effective
  • how to fall back when a preemption or capacity failure happens

SkyPilot provisions clusters, executes tasks, ships data if needed, and tears everything down.

It focuses on coarse-grained, long-running workloads: training, data processing, analytics pipelines, HPC jobs, and large batch inference. It is not a scheduler for microservices or low-latency systems.

Strengths

1. Availability Through Multi-Location Failover

SkyPilot’s provisioning logic continuously searches for capacity across zones, regions, and clouds.
This is particularly effective for GPU and spot markets, where a lack of capacity in one area doesn’t block the job entirely.

In production workloads, this leads to lower tail latency and fewer preemptions compared to single-cloud schedulers.

2. Hardware Arbitrage Across Clouds

SkyPilot can seamlessly move workloads between:

  • NVIDIA GPU families
  • AWS Inferentia / Trainium
  • Google TPUs
  • Azure MI and Confidential Compute

Different clouds excel for different workloads; SkyPilot makes that practical to exploit.

3. Cost Optimization

Compute cost often dwarfs data movement cost for training and HPC workloads.
SkyPilot’s optimizer uses ILP-based search to choose the cheapest viable placement—even if that means crossing cloud boundaries and paying egress.

This enables meaningful optimizations — e.g., training on TPUs and serving on Inferentia for 30–80% cost reduction, depending on workload.

4. Unified Job Definition and Resource Specification

Users declare what resources they need, not where to get them. For example:

Task("train").set_resources({
    Resources(accelerator="A100"), 
    Resources(accelerator="tpu-v3-8")
})

SkyPilot decides where those accelerators exist and provisions accordingly.

5. Builds on Existing Compatibility Instead of Forcing Standards

SkyPilot leverages interfaces that already exist universally:

  • containers
  • Ray
  • object stores (S3, GCS, Azure Blob)
  • SSH-based provisioning
  • standard ML frameworks

This avoids the “universal cloud standardization” problem that derailed earlier intercloud proposals.

Weaknesses and Limitations

1. Designed for Batch Workloads Only

SkyPilot is not suitable for:

  • online services
  • latency-critical workloads
  • multi-tenant platforms
  • very fine-grained parallelism

Its model is explicitly batch-oriented.

2. Limited Executor Diversity (Today)

SkyPilot primarily runs jobs using Ray clusters.
Kubernetes and HPC executors exist but are early-stage or incomplete.

Large organizations heavily invested in Kubeflow, Argo, or Flyte will see SkyPilot as complementary, not a replacement.

3. Data Gravity Remains a Constraint

SkyPilot can synchronize inputs and outputs via object stores, but cannot eliminate:

  • large lakehouse replication
  • block storage movement
  • stateful service migration

Very large datasets create natural placement constraints.

4. Quotas and Hardware Shortages Still Apply

SkyPilot retries intelligently, but:

  • account-level quotas
  • cross-cloud hardware shortages
  • region-wide GPU scarcity

are still real limitations.

5. Operational Maturity Is Still Growing

Compared to Terraform, Kubernetes, Ray, or managed workflow systems:

  • documentation is improving
  • logs and debugging can be verbose
  • multi-cloud failure explanation requires cloud knowledge
  • integrations with enterprise IAM/VPCs are basic

SkyPilot is production-used by research and ML teams, but not yet standard in enterprise platforms.

Current Status (2025)

Across documentation, the NSDI’23 paper, and community activity:

  • Supported clouds: AWS, GCP, Azure
  • Supported on-prem: via local Kubernetes and Ray clusters
  • Primary executor: Ray (stable)
  • Kubernetes executor: experimental
  • Workload types: ML training, hyperparameter sweeps, analytics, ETL, HPC
  • Catalog: tens of thousands of instances and accelerator SKUs tracked
  • Community: active multi-institution research and engineering

SkyPilot is currently one of the most complete implementations of Sky Computing ideas.

Roadmap & Likely Evolution

Near-Term

  • Full-featured Kubernetes executor
  • Better on-prem burst support
  • Improved performance modeling and autoscaling
  • Integration with workflow systems (Airflow, Flyte, Prefect)
  • Stronger data caching and checkpointing strategies

Medium-Term

  • Policy-based placement: cost ceilings, carbon intensity, regions to avoid
  • Cloud-provider–agnostic network and identity primitives
  • Better support for multi-cloud stateful pipelines
  • Specialized execution backends (e.g., HPC schedulers, LSF/Slurm)

Long-Term

  • Competing intercloud brokers, not just SkyPilot
  • True “compute marketplace” behavior driven by availability + cost
  • Broader adoption in research, education, and enterprise ML stacks

SkyPilot’s modular design makes these trajectories realistic.

Where SkyPilot Fits

Good Fit

  • ML training and research workloads
  • GPU-scarce teams needing alternative capacity
  • HPC-style pipelines
  • Teams with mixed cloud credits
  • Organizations exploring Sky Computing

Not Ideal

  • Stateful applications with data gravity
  • Real-time interactive services
  • Strictly regulated enterprise environments
  • Workloads requiring full reproducibility across clouds

SkyPilot shines when hardware flexibility matters more than strict locality or uniformity.

Summary

SkyPilot is a pragmatic step toward Sky Computing. It abstracts cloud choice at the job level, enabling workloads to run wherever hardware, price, or availability is best. It doesn’t replace workflow engines or schedulers; instead, it complements them by solving a narrower but high-impact problem: multi-cloud placement for compute-heavy batch tasks.

Its strengths are availability, cost optimization, and hardware flexibility. Its limitations lie in areas it intentionally doesn't address: low-latency services, stateful pipelines, or strict enterprise operational constraints.

As cloud hardware diversity continues to grow, especially for accelerators, SkyPilot is well-positioned to become a foundational piece of the emerging intercloud ecosystem.