SkyPilot: An Intercloud Execution Layer

Sky Computing proposes a simple idea with large implications: treat the major clouds as interchangeable execution backends, and let a broker place jobs where performance, price, and availability are best at that moment. SkyPilot is one of the first practical systems built around this vision. It provides a job-level abstraction over AWS, GCP, Azure, and on-prem clusters and automates provisioning, failover, and data movement.

This post summarizes SkyPilot’s core design, where it fits, and the strengths and limitations you should understand before adopting it.

What SkyPilot Is

SkyPilot is an intercloud execution broker for batch and ML workloads. Instead of tying a job to a specific cloud, users define a task with resource requirements (e.g., “1× V100 or 1× TPU v3-8”), and SkyPilot decides:

where hardware is available
which region offers adequate capacity
which provider is most cost-effective
how to fall back when a preemption or capacity failure happens

SkyPilot provisions clusters, executes tasks, ships data if needed, and tears everything down.

It focuses on coarse-grained, long-running workloads: training, data processing, analytics pipelines, HPC jobs, and large batch inference. It is not a scheduler for microservices or low-latency systems.

Strengths

1. Availability Through Multi-Location Failover

SkyPilot’s provisioning logic continuously searches for capacity across zones, regions, and clouds.
This is particularly effective for GPU and spot markets, where a lack of capacity in one area doesn’t block the job entirely.

In production workloads, this leads to lower tail latency and fewer preemptions compared to single-cloud schedulers.

2. Hardware Arbitrage Across Clouds

SkyPilot can seamlessly move workloads between:

NVIDIA GPU families
AWS Inferentia / Trainium
Google TPUs
Azure MI and Confidential Compute

Different clouds excel for different workloads; SkyPilot makes that practical to exploit.

3. Cost Optimization

Compute cost often dwarfs data movement cost for training and HPC workloads.
SkyPilot’s optimizer uses ILP-based search to choose the cheapest viable placement—even if that means crossing cloud boundaries and paying egress.

This enables meaningful optimizations — e.g., training on TPUs and serving on Inferentia for 30–80% cost reduction, depending on workload.

4. Unified Job Definition and Resource Specification

Users declare what resources they need, not where to get them. For example:

Task("train").set_resources({
    Resources(accelerator="A100"), 
    Resources(accelerator="tpu-v3-8")
})

SkyPilot decides where those accelerators exist and provisions accordingly.

5. Builds on Existing Compatibility Instead of Forcing Standards

SkyPilot leverages interfaces that already exist universally:

containers
Ray
object stores (S3, GCS, Azure Blob)
SSH-based provisioning
standard ML frameworks

This avoids the “universal cloud standardization” problem that derailed earlier intercloud proposals.

Weaknesses and Limitations

1. Designed for Batch Workloads Only

SkyPilot is not suitable for:

online services
latency-critical workloads
multi-tenant platforms
very fine-grained parallelism

Its model is explicitly batch-oriented.

2. Limited Executor Diversity (Today)

SkyPilot primarily runs jobs using Ray clusters.
Kubernetes and HPC executors exist but are early-stage or incomplete.

Large organizations heavily invested in Kubeflow, Argo, or Flyte will see SkyPilot as complementary, not a replacement.

3. Data Gravity Remains a Constraint

SkyPilot can synchronize inputs and outputs via object stores, but cannot eliminate:

large lakehouse replication
block storage movement
stateful service migration

Very large datasets create natural placement constraints.

4. Quotas and Hardware Shortages Still Apply

SkyPilot retries intelligently, but:

account-level quotas
cross-cloud hardware shortages
region-wide GPU scarcity

are still real limitations.

5. Operational Maturity Is Still Growing

Compared to Terraform, Kubernetes, Ray, or managed workflow systems:

documentation is improving
logs and debugging can be verbose
multi-cloud failure explanation requires cloud knowledge
integrations with enterprise IAM/VPCs are basic

SkyPilot is production-used by research and ML teams, but not yet standard in enterprise platforms.

Current Status (2025)

Across documentation, the NSDI’23 paper, and community activity:

Supported clouds: AWS, GCP, Azure
Supported on-prem: via local Kubernetes and Ray clusters
Primary executor: Ray (stable)
Kubernetes executor: experimental
Workload types: ML training, hyperparameter sweeps, analytics, ETL, HPC
Catalog: tens of thousands of instances and accelerator SKUs tracked
Community: active multi-institution research and engineering

SkyPilot is currently one of the most complete implementations of Sky Computing ideas.

Roadmap & Likely Evolution

Near-Term

Full-featured Kubernetes executor
Better on-prem burst support
Improved performance modeling and autoscaling
Integration with workflow systems (Airflow, Flyte, Prefect)
Stronger data caching and checkpointing strategies

Medium-Term

Policy-based placement: cost ceilings, carbon intensity, regions to avoid
Cloud-provider–agnostic network and identity primitives
Better support for multi-cloud stateful pipelines
Specialized execution backends (e.g., HPC schedulers, LSF/Slurm)

Long-Term

Competing intercloud brokers, not just SkyPilot
True “compute marketplace” behavior driven by availability + cost
Broader adoption in research, education, and enterprise ML stacks

SkyPilot’s modular design makes these trajectories realistic.

Where SkyPilot Fits

Good Fit

ML training and research workloads
GPU-scarce teams needing alternative capacity
HPC-style pipelines
Teams with mixed cloud credits
Organizations exploring Sky Computing

Not Ideal

Stateful applications with data gravity
Real-time interactive services
Strictly regulated enterprise environments
Workloads requiring full reproducibility across clouds

SkyPilot shines when hardware flexibility matters more than strict locality or uniformity.

Summary

SkyPilot is a pragmatic step toward Sky Computing. It abstracts cloud choice at the job level, enabling workloads to run wherever hardware, price, or availability is best. It doesn’t replace workflow engines or schedulers; instead, it complements them by solving a narrower but high-impact problem: multi-cloud placement for compute-heavy batch tasks.

Its strengths are availability, cost optimization, and hardware flexibility. Its limitations lie in areas it intentionally doesn't address: low-latency services, stateful pipelines, or strict enterprise operational constraints.

As cloud hardware diversity continues to grow, especially for accelerators, SkyPilot is well-positioned to become a foundational piece of the emerging intercloud ecosystem.