Skip to content

SkyPilot v0.5.0

Latest
Compare
Choose a tag to compare
@Michaelvll Michaelvll released this 27 Feb 04:36
· 160 commits to master since this release

SkyPilot v0.5.0: SkyServe, New Provisioner, LLMs, Kubernetes, and More Clouds

We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:

  • SkyPilot Serving
  • New provisioner
  • LLM recipes for the latest open models and engines
  • Kubernetes support improvement
  • 4 new clouds (contributed by the cloud providers!)

and more!

Release Highlights

New Features

  • Multiple candidate resources: SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, any_of or ordered in resources), allowing users to significantly enlarge the resource pool and get higher availability.
  • New Provisioner: Provisioner gets a new implementation, which is 2x faster and more reliable for supported clouds. Support launching clusters with more than 100 nodes. Dependency requirements for clouds are also significantly reduced.
  • Disk Tier: Introducing best disk tier for the best performance and cost, so you can choose the best disk for any cloud. (#2434)
  • Allow 2x spot jobs to be run concurrently
  • Mount storage back after cluster restart

SkyServe

SkyServe is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.

  • Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (#2458)
  • Autoscaler: Request rate based autoscaling policy. (#2868, #2878)
  • Autoscaler: Support scaling to 0 when no requests (#2938)
  • Rolling update: Support rolling update for existing services (#2935, #3057)

Other Enhancements

New LLM Recipes

Kubernetes

Kubernetes support received a number of New Features and Enhancements.

  • Multi-node support for Kubernetes (#2609, #3019)
  • Open ports support for Kubernetes (#2588, #2713, #2997, #3200)
  • Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (#2650)
  • Starting a kubernetes GPU cluster locally with sky local up (#2890)
  • Custom Image Support for Kubernetes Instances (#2729, #3019, #3210)
  • New provisioner for kubernets for better performance and robustneess (#3019)
  • Supporting Kubernetes cluster launched with k3s and Rancher (#3148)

Other Enhancements

More Clouds

SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: VMWare vSphere, RunPod, Fluidstack and Cudo Compute.

Clouds

AWS

New Features

  • New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (#1702, #2719, #2792)
  • Support for AWS Trainium accelerator (#2690)
  • Support null for proxy command to filter regions (#2756)
  • Support CUDA 12.1 with default image updates (#2788)
  • Job scheduling on Inferentia and Trainium (#2969, #2798)
  • Allow specifying security_group (#3133)

Enhancements

  • Make public / private subnet selection robust (#2867)
  • Avoid hanging for restarting an instance in STOPPING state (#2998)
  • Remove sunset instance types (#2610)
  • Add docs for custom VPC support (#2776)

Fixes

  • Fix conda installation on AWS default image (#3206)
  • Robustify the custom image support (#3216)
  • Fix subnet selection for AWS and autodown for spot instances (#2921)
  • Fix minimal permission for AWS (#2978)
  • Improve opening ports for AWS (#2716)
  • Autstop with new provisioner (#2719)

GCP

New Features

  • Security: Custom VPC support for GCP. (#2764, #2772, #2854, #2944)
  • Security: Support private IP with proxy jump on GCP. (#2819)
  • New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (#2681, #2719, #2943)
  • Automatically use reserved instances from multiple reserved pools (#2836, #2681)
  • Support L4 accelerator for GCP (#2724)
  • Allow stopping spot clusters on GCP (#2877)

Enhancements

  • Allow stopping VM with local SSD (#2587)
  • Update default runtime version for TPU node (#2601, #2602)
  • Handling transient error during launching GCP clusters (#2669)
  • Update GCSFuse version to 1.3.0 for GCS storage mount (#2887)
  • Set TPU VM the default option for TPU accelerators (#1758)
  • Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (#3028, #3172, #3234)

Fixes

  • Fix custom docker image support (#3218)
  • Fix minimal roles required for GCP (#2704)
  • Robustify the catalog fetching (#3141)
  • Fix ports on TPU VM and cluster launched before 0.4.0 (#2641)
  • Fix backward compatibility issue with GCP clusters (#2604)
  • Fix --disk-size for Custom Machine Images (#2718)
  • Update catalog fetcher with more options (#2562)
  • Assign GCP VMs with service account (#2972)
  • Fix machine image support (#3030, #3236)
  • Fix error handling for failed provisioning (#2852)
  • Leave out TPU v5 in catalog as it is not supported (#2656)
  • Fix GCP minimal permission (#2947, #2770, #2761)

Azure

Enhancements

  • Make ports openning more robust (#2649, #2891, #3084)
  • Additional arguments for Azure catalog fetcher and support H100 (#2561, #2844, #2847)
  • Support CUDA 12.1 with default image updates (#2468)
  • Support spot instances on Azure (#2871)

Fixes

  • Fix custom docker image support (#3218)
  • UX: Fix Azure disk tier explicitly shown in resources str (#3064)
  • Fix status query for Azure (#3015)

SCP

  • Fix SCP error raised in sky check (#3038)

CLI & Core interfaces

New Features

  • Multi-node jobs fail fast fast for single node failure (#3081)
  • Add configurations for not uploading credentials (#2904)
  • Adding sky status --endpoints CLI (#3199)
  • Support more characters in cluster name (#3130)
  • Show all regions and more accurate price in sky show-gpus (#2583, #2892, #2933, #2946, #3083, #3149, #3113)
  • Allow infering cloud from region or zone (#2632)
  • Add --commit and --version for sky CLI (#2720, #2731, #2733)

Enhancements

  • Robustify runtime initialization on remote cluster (#3132)
  • Better error message for YAML parsing (#3040)
  • Smarter GPU name completion (#3014)
  • Speed up retry until up by not doing exponential backoff (#2821)
  • Add schema validation for config (#2645)
  • Allow --disk-tier none override (#2906)
  • sky check improvement (#3174, #3212, #3160)
  • Better logging for CLIs (#2535, #2691, #2728, #3139, #3175)

Fixes

  • Fix permission issues for SSH config file on specific linux distributions (#3151)
  • Fix sky_logs and mounting directory (#2667, #2845)
  • Fix job related commands (#2662, #2767)
  • Fix sky logs with --sync-down (#2660)

Deprecations

  • Deprecate cpunode/gpunode/tpunode, hide admin (#2800)
  • Remove deprecated Local cloud which is now replaced by Kubernetes support (#3037, #3186)

Backend/Provisioner

New Features

  • Support multiple candidate resources (#2498, #2803, #2833, #2886, #3107)
  • Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (#3004, #3005)
  • Support spaces in paths (#2762)
  • Support long local username with special characters (#3105, #3130)

Enhancements

  • Robustify termination of failed clusters during failover (#2990)
  • Improve the ssh check for clusters just provisioned (#2797)
  • Robustify failover to avoid terminating clusters that has user data (#2977)
  • Move ssh config to ~/.ssh/generated/ssh instead of directly editing ~/.ssh/config (#2706, #3069)
  • Code refactoring and cleanup (#2541, #2736, #3046, #2633, #2870, #2925, #3087, #3088, #3153)
  • Improve usage collection (#2654, #2672)
  • Better explanation of failover in docs (#2850, #2834)

Fixes

  • Avoid backward compatibility issue with provisioner (#2682)
  • Fix cloud provisioning internal file mount cache (#2715)
  • Fix optimization for DAG when some resources provided are not feasible (#2657)
  • Fix runtime installation on remote VM (#2909, #2912)
  • Fix cluster termination when the cluster is not fully UP (#3025)
  • Fixes for tests (#2651, #2976, #3023, #3166, #3167, #3202)
  • Improve logging (#2594, #2678, #2696, #3003)

Managed spot

New Features

  • Allow 2x spot jobs to be run concurrently (#3191, #3208)

Enhancements

  • Better logging and UX (#2630)
  • Add docs for customizing spot controller (#2753)
  • Add spot pipeline docs (#2936)

Fixes

  • Fix private VPC support for spot jobs (#2874)
  • Fix ~/.sky/config.yaml for spot jobs (#2876)
  • Fix OOM for long running spot jobs (#2675)
  • Fix AWS NoCredentialError caused by credential rotation (#2695)
  • Fix Azure dependency on spot controller (#2875)

Storage

New Features

  • Mount storage back to clusters after restarted (#2322, #2804)

Enhancements

  • Clarify the syntax for external and managed storage (#3162, #2804)
  • Confirmation prompt for sky storage delete, and --yes flag to skip it (#2726)
  • Refactor and clean up storage code (#2774, #2986)

Fixes

  • Fix permission issue for S3 mounting on specific images (#3215)
  • Fix spaces in source path for storages (#2835)

Dependencies

  • Recommand nightly build in docs for better performance and robustness (#2984)
  • Automatic build for nightly Docker image (#2229)
  • Avoid ray dependency locally for AWS, GCP, and Kubernetes (#2625, #2943, #3019)
  • Remove AWS dependency by default for better setup time and less confliction (#2841, #2942)
  • Fix GCP dependency by updating google-api-python-client (#2577, #2759)
  • Pin remote dependency for ray job (#2659)
  • Robustify dependencies (#2642, #2679, #3024)

Examples

  • NeMo distributed training for BERT and GPT3 (#2533)
  • Add docker compose example to run multiple containers (#2745)
  • Distributed ray train example (#2828)
  • Benchmark Torch DDP (#2987)
  • Example updates for supported models (#2637, #2825)

Full Changelog: v0.4.0...v0.5.0

Thanks to all contributors!

New contributors: @rtalaricw, @jackyk02, @Vaibhav2001, @rohanvaidya45, @shrinandan, @manishiitg, @amitkumarj441, @tgaddair, @aseriesof-tubes, @changxiaohui, @thams, @kishb87, @PratikKumar125, @mmcclean, @dtran24, @davidwagnerkc, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123

Many thanks to all contributors who contributed to this release!

Contributors: @Michaelvll, @concretevitamin, @cblmemo, @romilbhardwaj, @MaoZiming, @landscapepainter, @sunny0826, @suquark, @Vaibhav2001, @infwinston, @hemildesai, @asaiacai, @shrinandan, @kishb87, @rtalaricw, @iojw, @aseriesof-tubes, @manishiitg, @jackyk02, @mmcclean, @thams, @amitkumarj441, @rohanvaidya45, @saihtaungkham, @tgaddair, @davidwagnerkc, @PratikKumar125, @dtran24, @changxiaohui, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123