Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Releases: microsoft/frameworkcontroller

v1.0.1

23 Feb 10:51
d602cdc
Compare
Choose a tag to compare

Updates

  • Keep podGracefulDeletionTimeoutSec as Nullable (#74)

Full Commit History Since Previous Release

v1.0.0

24 Jan 03:48
e658916
Compare
Choose a tag to compare

Updates

  • Switch to go mod (#69)
  • Upgrade CRD to apiextensions.k8s.io/v1 to support k8s >= 1.22 (#72)

Full Commit History Since Previous Release

v0.9.0

19 Oct 13:50
4b5707f
Compare
Choose a tag to compare

New Feature

  • Expose Task History (#62)

Full Commit History Since Previous Release

v0.8.0

31 Aug 13:25
959722c
Compare
Choose a tag to compare

New Feature

  • Expose and increase default sync concurrency (#60)
  • Treat invalid Pod caused by network error as PodCreationUnknownError (#61)

Full Commit History Since Previous Release

v0.7.0

11 Aug 12:06
29e1153
Compare
Choose a tag to compare

v0.6.0

10 Feb 13:09
c4be168
Compare
Choose a tag to compare

New Feature

  • Enrich PodSpecError to early fail Pod (#52)

Bug Fix

  • Fix invalid json in log caused by fmt (MISSING) (#49)
  • Aware UID change during Update event and Sync (#51)

Full Commit History Since Previous Release

v0.5.0

06 Nov 09:44
b819592
Compare
Choose a tag to compare

New Feature

  • Support large scale Framework by LargeFrameworkCompression (#44)
  • Add PodNodeName to help track failures on node before PodIP is available (#45)

Full Commit History Since Previous Release

v0.4.0

09 Oct 12:12
77ec4ab
Compare
Choose a tag to compare

New Feature

  • Add example to leverage HivedScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling (#34)
  • Support to expose Framework and Pod history snapshots to external systems (#31)
  • Support to classify and summarize Pod failures (#41)
  • Support to tune Framework Consistency vs Availability (#43)
    This helps to avoid the Pod is stuck in deleting forever, such as if its Node is down forever.
  • Support Stop Framework (#24)
    This helps to stop the Framework without deleting it.
  • Still sync Task after FrameworkAttempt Completing (#27)
    This helps to make sure all Tasks in the Framework are updated to the right completed status when the whole FrameworkAttempt is completed.
  • Support FrameworkCompletedRetainSec (#37)
    This helps to automatically delete the Framework after it is completed for a long time, to free ETCD space.
  • Add FrameworkAttemptPreparing State (#12)
    This helps to distinguish if there is at least one Task of current attempt has ever entered TaskAttemptRunning state. If not, it is FrameworkAttemptPreparing instead of FrameworkAttemptRunning anymore.
  • Redefine FrameworkAttemptRunning and Record attempt running start time (#35)
    This helps to measure Framework and Task pure running duration.
  • Support Pod Template Placeholders (#21)

Bug Fix

  • Fix TaskCompleted may transition to TaskAttemptCompleted (#10)
  • Fix fExpectedStatusInfos map race condition (#18)

Misc

  • Upgrade to kubernetes-1.14.2 (#16)
  • Remove Internal and External CompletionTypeAttribute (#22 #41)
    This is because FrameworkController does not need to aware it, so leave the freedom to controller wrapper
  • Upgrade to golang 1.12.6 (#29)
  • Switch to klog (#30)

Full Commit History Since Previous Release

v0.3.0

17 Jan 10:56
4ba92dc
Compare
Choose a tag to compare

v0.2.0

23 Nov 08:03
94a1680
Compare
Choose a tag to compare

Add Distributed TensorFlow Training Example

Feature

  1. Support both GPU and CPU Distributed Training
  2. Automatically clean up PS when the whole FrameworkAttempt is completed
  3. No need to adjust existing TensorFlow image
  4. No need to setup Kubernetes DNS and Kubernetes Service
  5. Common Feature

Prerequisite

  1. Need to setup Kubernetes GPU, if you need GPU Training
  2. Need to setup Kubernetes Cluster-Level Logging, if you need to persist and expose the log for deleted Pod

Quick Start

  1. Common Quick Start
  2. Tensorflow Example

Support FrameworkBarrier for GangExecution

Feature

It is usually used as the InitContainer to provide a simple way to

  1. Do Gang Execution without resource deadlock
  2. Start the AppContainers in the Pod only after its PodUID is persisted by FrameworkController
  3. Inject peer-to-peer service discovery information into the AppContainers

Quick Start

  1. FrameworkBarrier User Manual