Skip to content

Releases: zillow/metaflow

zg-2.2

22 Mar 22:51
18ad4df
Compare
Choose a tag to compare

What's Changed

Full Changelog: zg-2.1...zg-2.2

zg-2.1

15 Feb 18:50
ce30b72
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.3.2409+2.5.4...zg-2.1

zg-1.3

01 Nov 17:55
22c3f21
Compare
Choose a tag to compare

Tagging last version of 1.3 before a 2.0 major feature/aip major version update.

What's Changed

New Contributors

Full Changelog: zg-1.2...1.3.2409+2.5.4

zg-1.2

29 Apr 17:43
6a1ffd0
Compare
Choose a tag to compare

To use this version you need build 1.2.1418+2.5.4 or above.

Main Changes

  • Upstream Merge
  • Features
    • Flow can now trigger downstream pipelines uploaded to KFP (#150)
    • metaflow.S3 tmproot default to PVC (#169)
  • Compatibility Fix
    • Fix compatibility issue with argo 1.5.0 (#174)

What's Changed

Full Changelog: zg-1.0...zg-1.2

zg-1.0

29 Apr 17:29
a42dc9e
Compare
Choose a tag to compare

Main Changes

  • Bugfixes for @s3_sensor, PLEG Stability and node utilization issues (using high-memory toleration)
  • Support for pytest coverage
  • Switch to ZG version schema, so that our internal breaking changes are reflected in version number

What's Changed

  • AIP-4600: Refactor Gitlab CI pipeline to include publishing lib to Artifactory by @alexlatchford in #115
  • @kfp(image=) to support customers who want to specify image per step by @hsezhiyan in #120
  • AIP-4600: Relax pylint version by @alexlatchford in #125
  • AIP-5183 - Fixing regression in @s3_sensor by @hsezhiyan in #126
  • Pod toleration based on CPU and memory by @cloudw in #129
  • AIP-5103: Swap over feature branches to use dev releases versioning scheme by @alexlatchford in #128
  • AIP-5103: Move to leverage the aip-py-cpu base image and remedy Python build errors by @alexlatchford in #132
  • AIP-5283 - Fix @s3_sensor usage with @resources(volume=...) and --notify by @hsezhiyan in #131
  • Support pytest coverage of customer Flows by @talebzeghmi in #127
  • METAFLOW_COVERAGE_OMIT check for None by @talebzeghmi in #135
  • AIP-5330 set default retry policy="Always" (even on PodDeletion) by @talebzeghmi in #134
  • Handle None value of COVERAGE_OMIT by @cloudw in #140
  • AIP-5068 - Reduce PLEG Stability Issues by @hsezhiyan in #137
  • Use "purpose: high-memory" toleration instead of "instance-type: r5.12xlarge" by @cloudw in #143
  • AIP-5333 - @s3_sensor resilient to failures by @hsezhiyan in #146

Full Changelog: 2.3.2+zg2.0...zg-1.0

Workflow SDK Release 2.3.2+zg2.0

15 Sep 23:26
6afc8f9
Compare
Choose a tag to compare

The Workflow SDK 2.3.2+zg2.0 release is a major release.

Release Summary

Features

Breaking Change - Enforcing Guaranteed Quality of Service for Pods in KFP plugin
Pods that have limits way larger than requests have been a problem for cluster stability. In extreme cases hosts may have total burstable resource limits 50 times more than what's available. To resolve this issue we are trying to enforce Guaranteed QoS across the board.

In the Workflow SDK, cpu_limit, memory_limit, and local_storage_limit have been removed from @resource decorator. Users can only provide single values for cpu, memory, or local_storage, and both requests and limits will be set to the same value.

In Spark integration (spark related code change not in this repo):

  • If the user provides limits.cpu
    • If requests.cpu are also provided, limits.cores and requests.cores MUST have same values or ValueError will be raised
    • If requests.cpu are NOT provided, limits.cpu will be used as the requests.cpu as well.
  • If the user does not provide limits.cpu
    • If requests.cpu are provided, limits.cpu = requests.cpu
    • elif the user provides "spark.executor.cores", the value will be used for both limits.cpu and requests.cpu
    • else limits.cpu and requests.cpu will be set to default "spark.executor.cores" which is 1

Stream logging for KFP plug-in
This is a feature pulled from upstream Metaflow version 2.2.10 and adapted for KFP plugin. Several changes:

  • Logs are published to datastore via a sidecar process periodically. For KFP plugin logs used to be available in datastore only when the step finishes.
  • You may access logs using python flow.py <run-id>/<step-name>
    • For retried steps, only logs from last retry will be printed. All logs are available in datastore.

Allow sharing attached volume across split nodes
By specifying @resources(volume_mode="ReadWriteMany", volume=<desired amount>), attached volume will be shared across split nodes of the same step.

Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
By default pods are now labeled for their experiment, flow and step name for more detailed cost tracking

Changes from upstream

Here is a partial list of changes from upstream that are applicable to ZG AI Platform.
For the full change list please see release notes from 2.2.5 to 2.3.2

Features

Bug Fix

2.2.5+zg1.1

14 Jun 17:50
6101ef0
Compare
Choose a tag to compare

Workflow SDK Release 2.2.5+zg1.1

Release Summary:

Support for Persistent Volume Claim (PVC)

To use disk space, you can now specify persistent volume in @resource decorator per step. It is as simple as

@resources(volume="30G")
@step
def my_task():
    ...

By default the volume is mounted to /opt/metaflow_volume, and this volume is only available for the step decorated. If @retry is used, the volume will be shared across retries of this step - nice if you want to pick up from previous progress, and be sure to clean up otherwise.

You have options to customize PVC mount path, or make the volume available to all steps onwards. Two additional attributes volume_dir and volume_mode are needed:

@resources(volume="30G", volume_mode="ReadWriteMany", volume_dir=<your_preferred_path>)
@step
def my_task():
    ...

Refer to doc string here for more details.

PyTorchDistributedDecorator (@pytorch_distributed) is deprecated due to implementation similarity.

P3 GPU Instance Support

We are adding an option for P3 instance when a more powerful GPU is handy - introducing @accelerator decorator!

@accelerator sets the taints and node label for your steps. To request P3 instance:

@accelerator(type="nvidia-tesla-v100")
@resources(...)
@step
def my_task():
    ...

While other instances can be requested similarly in the future, additional work is needed to support each type. Please let us (aip teams) know if other unsupported instance types suit your use cases better.

Improve Zodiac integration and cost tracking

Services now automatically tagged with zodiac_service, zodiac_team. As a result cost will be tracked in each team's Zodiac page base on namespace profile settings. Be sure to update your team's Kubeflow profile to take advantage of this feature

Improve Datadog integration

Flow name, experiment name, run id, and step name are automatically added to K8s pod labels.
Stay tuned for dashboard filters using these attributes.

Metadata reporting fix in CICD

Fix a bug where metadata is determined at compile time, and not correctly tracking run time environment when uploading to Metaflow service.