Skip to content

SkyPilot v0.6.0

Latest
Compare
Choose a tag to compare
@romilbhardwaj romilbhardwaj released this 30 May 23:33
e37a39d

SkyPilot v0.6.0: Jobs API, SkyServe on Kubernetes, Spot + On-demand mixing, Paperspace support and more!

We are excited to release SkyPilot v0.6.0! This release includes a number of new features:

  • Managed Jobs for job execution and recovery
  • SkyServe and Jobs on Kubernetes
  • Mix on-demand and spot instances in SkyServe
  • New cloud: Paperspace

Release Highlights

Managed Jobs

  • The spot controller has been enhanced to support any job on on-demand or spot instances.
    • To use, run sky jobs launch instead of sky spot launch.
  • The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
  • The sky jobs API is identical to the sky spot API, but also supports on-demand instances.

SkyServe and Jobs on Kubernetes

  • SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
    • This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
  • Simply run sky jobs launch or sky serve up, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.

Mix on-demand and spot instances in SkyServe

  • SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. Example.
    • Uses on-demand instances to ensure availability and spot instances to save costs.
  • Dynamically falls back to on-demand replicas when spot replicas are not available. Example.

Paperspace support

  • Newest cloud to join the Sky: Paperspace!
    • Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
  • Simply add your Paperspace API key to ~/.paperspace/config.json and run sky check paperspace to get started.
  • Big thanks to @asaiacai for contributing Paperspace support!

More LLMs and Recipes

Deprecation Notes

The following features have been deprecated and will be removed in the next minor release:

  • sky spot CLI: use sky jobs CLI instead.
  • core.spot_xxx APIs: refactored to jobs.xxx.
  • qps_lower_threshold and auto_restart in service: use target_qps_per_replica instead.

Changelog

Managed Jobs

  • Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (#3289)
  • The name of the spot job is now included in the SKYPILOT_TASK_ID environment variable (#3424)
  • Legacy spot job APIs have been refactored from core.spot_xxx to jobs.xxx (#3417)
  • Cloud for the controller is now chosen based on the resources of the replicas (#3363)
  • Bug fixes (#3302, #3397, #3459, #3468, #3480)

SkyServe

New Features

  • New intelligent policy for mixing spot and on-demand instances in SkyServe (#3194)
  • SkyServe now uses proxy instead of HTTP redirect responses for better performance (#3395)
  • Readiness probe now supports headers: this is useful for authentication or other headers required for readiness checks (#3552)

Enhancements

  • Optimizations - replicas are reused when only service section is changed (#3214)
  • Rolling updates are now the default behavior for SkyServe (#3249)
  • Controller cloud is now chosen from replica resources if it is not already up (#3231)
  • Bug fixes and API improvements (#3257, #3299, #3303, #3411, #3411, #3546)

Kubernetes

  • Kubernetes clusters can now run SkyServe and Managed Jobs (#3377, #3524, #3521)
  • sky show-gpus now shows realtime availability of GPUs in the cluster (#3499)
  • Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (#3513, #3415)
  • Use Kubernetes service accounts by specifying remote_identity in ~/.sky/config.yaml (#3377, #3527)
  • sky local up now also automatically installs the Nginx Ingress Controller (#3223)
  • Support for specifying custom pod configurations with pod_config (#3244)
    • Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting HTTP_PROXY and more! See example pod_config here.
  • Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (#3333)
    • Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
  • Support for PodIP mode for exposing ports (#3445)

Enhancements

  • GPU Isolation: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (#3443)
  • Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (#3263, #3373)
  • All SkyPilot pods are now labelled with skypilot-user to identify the owner of the pod (#3576)
  • Special characters in environment variables are now correctly parsed (#3322)
  • GPU labelling is now more robust (#3274)
  • Bug fixes and quality of life improvements (#3266, #3392, #3439, #3509, #3524, #3525, #3532, #3563, #3578, #3374)

CLI & Core interfaces

New Features

  • resources now supports labels field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (#3464, #3505)
  • sky check now supports checking credentials for specific clouds, e.g. sky check aws gcp (#3229)
    • You can also restrict which clouds are checked by setting allowed_clouds in ~/.sky/config.yaml. (#3556)
  • any_of or ordered fields in resources can now have clouds that are not enabled (#3567)
  • A new environment variable SKYPILOT_CLUSTER_INFO, containing cluster name, cloud, region and zone is now available in all tasks (#3424)

Enhancements

  • Optimizer is up to 10x faster when multiple resources are specified (#3567)
  • Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (#3205)
  • GCP GPUs now include DEVICE_MEM in sky show-gpus (#3375)
  • Better sorting for sky show-gpus (#3492)
  • Handling for usernames containing invalid characters (#3528)
  • Null environment variables now raise an error (#3557)

Runtime & Backend

Optimizations

  • Lazy imports for 2x faster import times (#3394, #3463)
  • Faster setup and job submission (#3523, #3484),

Cloud: GCP

  • H100 GPUs are now supported on GCP (#3279)
  • Support for fine-grained GCP IAM permissions (#3284)

Cloud: Azure

  • Custom images are now supported on Azure. Simply specify image_id in the resources field. (#3362)
  • 8x faster autostop for Azure (#3519)
  • Fix GPUs not being detected in Azure (#3313)
  • Provisioning fixes (#3483)

Cloud: AWS

  • Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (#3488, #3514)
  • SkyPilot can now be run in ECS containers by assuming container-role IAM roles (#3503)
  • SkyPilot will not delete user-specified security groups (#3402)

Cloud: Fluidstack

  • H100 and A100 Nvlink support for Fluidstack (#3467)
  • Opening ports is now supported for Fluidstack (#3294)
  • Bug fixes (#3254, #3265)

Other Clouds

  • Bug fixes for Lambda provisioning and termination (#3409, #3410)
  • Multi-gpu fixes for RunPod (#3291)
  • Cudo: handle missing project errors (#3438)

Thanks to all contributors!

New contributors: @MysteryManav, @JGSweets, @Harthgar, @mjkanji

Many thanks to all contributors who contributed to this release!

Contributors: @Michaelvll, @romilbhardwaj, @concretevitamin, @cblmemo, @MaoZiming, @shethhriday29, @asaiacai, @JGSweets, @mjkanji, @MysteryManav, @landscapepainter, @Harthgar, @mjibril, @dtran24, @fozziethebeat, @JungleCatSW

Full Changelog: v0.5.0...v0.6.0