Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History
945 lines (771 loc) · 46.5 KB

user-manual.md

File metadata and controls

945 lines (771 loc) · 46.5 KB

See FrameworkController DeepDive.pptx

As Framework is actually a Kubernetes CRD, all CRD Clients can be used to interoperate with it, such as:

  1. kubectl
    kubectl create -f {Framework File Path}
    # Note this is not Foreground Deletion, see [DELETE Framework] section
    kubectl delete framework {FrameworkName}
    kubectl get framework {FrameworkName}
    kubectl describe framework {FrameworkName}
    kubectl get frameworks
    kubectl describe frameworks
    ...
  2. Kubernetes Client Library
  3. Any HTTP Client
API Kind Operations
Framework CREATE DELETE GET LIST WATCH WATCH_LIST
PATCH (Start, Stop, Add TaskRole, Delete TaskRole, Add/Delete Task)
ConfigMap All operations except for CREATE PUT PATCH
Pod All operations except for CREATE PUT PATCH

Request

POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks

Body: Framework

Type: application/json or application/yaml

Description

Create the specified Framework.

Any ExecutionType can be specified to create the Framework.

Response

Code Body Description
OK(200) Framework Return current Framework.
Created(201) Framework Return current Framework.
Accepted(202) Framework Return current Framework.
Conflict(409) Status The specified Framework already exists.

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

[
  {
    "op": "replace",
    "path": "/spec/executionType",
    "value": "Start"
  }
]

Type: application/json-patch+json

Description

Start the specified Framework whose ExecutionType should be Create.

Before the Start, the Framework will not start to run or complete, but the object of the Framework is created, see Framework PreStart Example.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

[
  {
    "op": "replace",
    "path": "/spec/executionType",
    "value": "Stop"
  }
]

Type: application/json-patch+json

Description

Stop the specified Framework whose ExecutionType should be Create or Start.

After the Stop, the Framework will start to complete, but the object of the Framework will not be deleted, see Framework PostStop Example.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Follow the TaskRoleSpec to override below $(TaskRoleSpec) placeholder.

[
  {
    "op": "add",
    "path": "/spec/taskRoles/-",
    "value": $(TaskRoleSpec)
  }
]

Type: application/json-patch+json

Description

Append the specified TaskRole to the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "remove",
    "path": "/spec/taskRoles/$(TaskRoleIndex)"
  }
]

Type: application/json-patch+json

Description

Delete the specified TaskRole from the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Notes:

  • Better to first delete all Tasks in the TaskRole and wait until they are all deleted, then delete the whole TaskRole. Otherwise, the Tasks in the TaskRole will be deleted according to last observed PodGracefulDeletionTimeoutSec in TaskRoleStatus, as the PodGracefulDeletionTimeoutSec in TaskSpec is already deleted.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.
UnprocessableEntity(422) Status The specified $(TaskRoleName) does not exist or does not match the specified $(TaskRoleIndex).

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
    "value": $(TaskNumber)
  }
]

Generally, you may also need to adjust the TaskRole's FrameworkAttemptCompletionPolicy according to the new $(TaskNumber). It is safe as Framework ScaleUp/ScaleDown Strong Safety Guarantee.

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
    "value": $(TaskNumber)
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minSucceededTaskCount",
    "value": $(MinSucceededTaskCount)
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": $(MinFailedTaskCount)
  }
]

Type: application/json-patch+json

Description

Update the TaskNumber (and the FrameworkAttemptCompletionPolicy) of the specified TaskRole in the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.
UnprocessableEntity(422) Status The specified $(TaskRoleName) does not exist or does not match the specified $(TaskRoleIndex).

Request

DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

application/json

{
  "propagationPolicy": "Foreground"
}

application/yaml

propagationPolicy: Foreground

Type: application/json or application/yaml

Description

Delete the specified Framework.

Notes:

Response

Code Body Description
OK(200) Framework The specified Framework is deleting.
Return current Framework.
OK(200) Status The specified Framework is deleted.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Description

Get the specified Framework.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Get all Frameworks (in the specified FrameworkNamespace).

Response

Code Body Description
OK(200) FrameworkList Return all Frameworks (in the specified FrameworkNamespace).

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of the specified Framework.

Response

Code Body Description
OK(200) WatchEvent Streaming the change events of the specified Framework.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of all Frameworks (in the specified FrameworkNamespace).

Response

Code Body Description
OK(200) WatchEvent Streaming the change events of all Frameworks (in the specified FrameworkNamespace).

Framework ExecutionType can be specified to control the execution of the Framework:

  1. You can just Create Framework with Create ExecutionType, which does not also start it at the same time.
  2. You can just Stop Framework, which does not also delete it at the same time.

In this example, you need to run a Framework which depends on a ServiceAccount, but the ServiceAccount also depends on the Framework object to be OwnerReferences, so you cannot directly Create Framework with ExecutionType Start.

  1. Create Framework with Create ExecutionType and a ServiceAccount reference as below, then the Framework will stay as AttemptCreationPending:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: prestart
spec:
  executionType: Create
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          serviceAccountName: prestart
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]
  1. Use above creation response's metadata.uid to override below {{FrameworkUID}}, and Create ServiceAccount with above Framework reference as below:
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prestart
  ownerReferences:
  - apiVersion: frameworkcontroller.microsoft.com/v1
    kind: Framework
    name: prestart
    uid: {{FrameworkUID}}
    controller: true
    blockOwnerDeletion: true
  1. Start Framework, then the Framework will start to run successfully.
  2. Delete Framework, then both the Framework and above ServiceAccount will be deleted.

In this example, you need to stop a Framework whose final stopped Framework object needs to be pushed to/pulled by external systems, so you cannot directly Delete Framework.

  1. Create Framework as below:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: poststop
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]
  1. Stop Framework, then the Framework will be stopped, i.e. FrameworkCompleted.
  2. Get Framework, and archive it into a DataBase first.
  3. Delete Framework, then the Framework will be deleted.

Predefined Container EnvironmentVariable

Framework Example

You can specify how to classify and summarize Pod failures by the PodFailureSpec.

You can also directly leverage the Default PodFailureSpec.

You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.

CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.

TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.

FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.

RetryPolicySpec

RetryPolicySpec

Notes:

  1. Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.
  2. For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.
FrameworkType Framework RetryPolicy TaskRole Task RetryPolicy Description
DEFAULT FancyRetryPolicy = false
MaxRetryCount = 0
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = 0
The default RetryPolicy:
Never Retry for any Failed or Succeeded.
TaskRole-B FancyRetryPolicy = false
MaxRetryCount = 0
Service FancyRetryPolicy = false
MaxRetryCount = -2
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = -2
Always Retry for any Failed or Succeeded.
Blind Batch FancyRetryPolicy = false
MaxRetryCount = -1
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = -1
Always Retry for any Failed.
Never Retry for Succeeded.
Batch with Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole-A FancyRetryPolicy = true
MaxRetryCount = 3
Always Retry for Transient Failed.
Never Retry for Permanent Failed or Succeeded.
Retry up to 3 times for Unknown Failed.
Batch without Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = 0
For Framework RetryPolicy, same as "Batch with Task Fault Tolerance".
For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded.
Debug Mode FancyRetryPolicy = true
MaxRetryCount = 0
TaskRole-A FancyRetryPolicy = true
MaxRetryCount = 0
Always Retry for Transient Failed.
Never Retry for Permanent Failed or Unknown Failed or Succeeded.
This can help to capture the unexpected exit of user application itself.

CompletionPolicySpec

CompletionPolicySpec

Notes:

  1. Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.
FrameworkType TaskRole FrameworkAttemptCompletionPolicy Description
DEFAULT TaskRole-A MinFailedTaskCount = 1
MinSucceededTaskCount = -1
The default FrameworkAttemptCompletionPolicy:
Fail the FrameworkAttempt immediately if any Task failed.
Succeed the FrameworkAttempt only until all Tasks succeeded.
TaskRole-B MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Service TaskRole-A MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example.
MapReduce Map MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1
MinSucceededTaskCount = -1
A few failed Tasks is acceptable, but always want to wait all Tasks to succeed:
Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit.
Succeed the FrameworkAttempt only until all Tasks completed and the failed Tasks is within the limit.
Reduce MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1
MinSucceededTaskCount = -1
Master Dominated: PyTorch Training with explict master role, TensorFlow Training with explict chief role Master/Chief MinFailedTaskCount = 1
MinSucceededTaskCount = 1
The completion is fully determined by the single instance master of the user application:
Fail the FrameworkAttempt immediately if the master failed.
Succeed the FrameworkAttempt immediately if the master succeeded.
Worker MinFailedTaskCount = -1
MinSucceededTaskCount = -1
All Workers Dominated: PyTorch Training without explict master role, TensorFlow Training without explict chief role ParameterServer/None MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Fail the FrameworkAttempt immediately if any Task failed.
Succeed the FrameworkAttempt only until all workers succeeded.
Worker MinFailedTaskCount = 1
MinSucceededTaskCount = {Worker.TaskNumber}
Any Worker Dominated: PyTorch Elastic Training Worker MinFailedTaskCount = {Worker.TaskNumber}
MinSucceededTaskCount = 1
Fail the FrameworkAttempt only until all workers failed.
Succeed the FrameworkAttempt immediately if any worker succeeded.

Framework ScaleUp/ScaleDown (Rescale) refers to take any below action for an existing Framework on the fly:

  1. Add/Delete TaskRole without touching other TaskRoles.
  2. Add/Delete Task without touching other Tasks.

Before you start to Rescale Framework, make sure your application executed by the Framework can tolerate Rescale, such as:

  1. Your application should be able to rebalance its workload after Rescale:
    1. For Service application, may need to rebalance its serving traffic by a load balancer, and/or re-replicate its state.
    2. For Batch application, may need to rebalance its work items by a work queue, and/or adjust its Tasks membership, like PyTorch Elastic Training.
  2. For Batch application, it would better not rerun after ScaleUp:
    1. Rerun may happen if the application already completed, but the completion event have not yet been observed by FrameworkController. So, during this period, if you ScaleUp the Framework, the application may rerun.
    2. To mitigate it, the ScaleUp Task can immediately complete itself by leveraging empty work queue or existing checkpoint from previous run.
  3. For Batch application, it would better not too early succeeded after ScaleDown:
    1. Too early succeeded may happen if all Tasks succeeded except one Task still running, but you ScaleDown the running Task.
    2. To resolve it, make sure it is safe to ScaleDown the running Task, such as leverage Master Dominated or Any Worker Dominated FrameworkType in FrameworkAttemptCompletionPolicy. For the Master Dominated FrameworkType, a master TaskRole is needed and do not ScaleDown the master TaskRole. For the Any Worker Dominated FrameworkType, an exit barrier may be needed to ensure any worker succeeded means the whole application already succeeded, like PyTorch Elastic Training.

This example will demonstrate the basic usage of Framework Rescale, as well as its Strong Safety Guarantee.

  1. Create Framework as below, and wait until all Tasks are AttemptRunning:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: rescalebasic
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]
  1. Delete Pod rescalebasic-a-2, rescalebasic-a-3, and wait until their Tasks are Completed (Failed).
  2. ScaleDown Framework: Decrease the taskNumber and minFailedTaskCount from 4 to 2 by below patch:
[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 2
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 2
  }
]
  1. The Completed Tasks rescalebasic-a-2, rescalebasic-a-3 are immediately deleted, and the Framework stays AttemptRunning.
    • This demonstrates SafetyGuarantee1, as otherwise, the old Failed Tasks rescalebasic-a-2, rescalebasic-a-3 may be wrongly considered in the new FrameworkAttemptCompletionPolicy.minFailedTaskCount (2) and triggers the completion.
  2. ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 4 by below patch, and wait until all Tasks are AttemptRunning:
[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 4
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 4
  }
]
  1. Delete Pod rescalebasic-a-2, and wait until its Task is Completed (Failed).
  2. Redo Step 3, and wait until the rescalebasic-a-2, rescalebasic-a-3 Tasks are DeletionPending, but before rescalebasic-a-3 is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 3 by below patch:
[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 3
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 3
  }
]
  1. The Completed Task rescalebasic-a-2 is immediately replaced by a new Task instance, the AttemptRunning Task rescalebasic-a-3 is eventually deleted, and the Framework stays AttemptRunning.
    • This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task rescalebasic-a-2 may be wrongly reused in later ScaleUp.
  2. ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 3 to 4 by below patch, and wait until all Tasks are AttemptRunning:
[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 4
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 4
  }
]
  1. Delete Pod rescalebasic-a-2, and wait until its Task is Completed (Failed).
  2. Redo Step 3, and wait until the rescalebasic-a-2, rescalebasic-a-3 Tasks are DeletionPending, but before rescalebasic-a-3 is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 5 by below patch:
[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 5
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 5
  }
]
  1. The Completed Task rescalebasic-a-2 is immediately replaced by a new Task instance, the AttemptRunning Task rescalebasic-a-3 is eventually replaced by a new Task instance, the new Task rescalebasic-a-4 is immediately added, and the Framework stays AttemptRunning.
    • This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task rescalebasic-a-2, rescalebasic-a-3 may be wrongly reused in later ScaleUp.

PyTorch Elastic Training On FrameworkController

ScaleUp Pipeline:

  1. As soon as FrameworkController observes not Completing/Completed Framework ScaleUp request, it will immediately mark the ScaleUp Task as AttemptCreationPending then persist (expose) the Task, before take any other action for the Framework, such as start to create its TaskAttempt or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
  2. As soon as the AttemptCreationPending Task is persisted (exposed), the Task will impact its Framework in future, such as:
    1. The Task will be considered in FrameworkAttemptCompletionPolicy.
    2. The Task will be deleted in later ScaleDown.
  3. Only until then, FrameworkController will start to create the AttemptCreationPending TaskAttempt.

ScaleDown Pipeline:

  1. As soon as FrameworkController observes not Completing/Completed Framework ScaleDown request, it will immediately mark the ScaleDown Task as DeletionPending then persist (expose) the Task, before take any other action for the Framework, such as start to delete the Task or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
  2. As soon as the DeletionPending Task is persisted (exposed), the Task will never impact its Framework in future anymore except for the Task's graceful deletion period itself, such as:
    1. The Task will never be considered in FrameworkAttemptCompletionPolicy.
    2. The Task will never be reused in later ScaleUp.
  3. Only until then, FrameworkController will start to delete the DeletionPending Task.
  4. After Framework Completing/Completed, the Framework may still contain DeletionPending Tasks but these Tasks must be Completed.

Notes:

  1. User also needs to explicitly ignore the DeletionPending Task if he does not care about the Task's graceful deletion period, like FrameworkBarrier.

Besides these general Framework ConsistencyGuarantees, FrameworkController also provides below Strong Safety Guarantees for Framework Rescale:

  • SafetyGuarantee1:
  • SafetyGuarantee2:
    • User can always safely Rescale before any previous Rescale totally finished (including the ScaleDown Task final deletion), as:
      • For ScaleUp immediately followed by a ScaleDown:
        • If the user observed a new non-DeletionPending Task (caused by ScaleUp), later ScaleDown will delete it (i.e. ScaleUp committed), otherwise, later ScaleDown may not delete it but previous ScaleUp must not have impacted it, such as start to create its TaskAttempt (i.e. ScaleUp rollbacked).
      • For ScaleDown immediately followed by a ScaleUp:
        • If the user observed a new DeletionPending Task (caused by ScaleDown), later ScaleUp will not reuse it (i.e. ScaleDown committed), otherwise, later ScaleUp may reuse it but previous ScaleDown must not have impacted it, such as start to delete it (i.e. ScaleDown rollbacked).

See Framework Rescale Basic Example to demonstrate these Strong Safety Guarantees.

To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the LargeFrameworkCompression. However, you may also need to decompress the Framework by yourself.

By leveraging the LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework, Task and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

FrameworkState

TaskState

For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}:

  • ConsistencyGuarantee1:

    At most one instance of the Task is running at any point in time.

  • ConsistencyGuarantee2:

    No instance of the Task is running if it is deleted (or does not exist), TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted (or does not exist).

For a specific Framework identified by {FrameworkName}:

  • ConsistencyGuarantee3:

    At most one instance of the Framework is running at any point in time.

  • ConsistencyGuarantee4:

    No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted (or does not exist).

The default behavior is to achieve all the ConsistencyGuarantees, if you do not explicitly violate below guidelines:

  1. Achieve ConsistencyGuarantee1:

    Do not force delete the managed Pod:

    1. Do not set PodGracefulDeletionTimeoutSec to be not nil.

      For example, the default PodGracefulDeletionTimeoutSec is acceptable.

    2. Do not delete the managed Pod with 0 GracePeriodSeconds.

      For example, the default Pod deletion is acceptable.

    3. Do not delete the Node which runs the managed Pod.

      For example, drain the Node before delete it is acceptable.

    The Task running instance can be universally located by its TaskAttemptInstanceUID or PodUID.

    To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as Delete StatefulSet Pod only after the Pod termination has been confirmed manually or by your Cloud Controller Manager.

  2. Achieve ConsistencyGuarantee2, ConsistencyGuarantee3 and ConsistencyGuarantee4:

    1. Achieve ConsistencyGuarantee1.

    2. Must delete the managed ConfigMap with Foreground PropagationPolicy.

      For example, the default ConfigMap deletion is acceptable.

    3. Must delete the Framework with Foreground PropagationPolicy.

      For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.

    4. Do not change the OwnerReferences of the managed ConfigMap and Pods.

    The Framework running instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.

According to the CAP theorem, in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the Framework Consistency and the Framework Availability.

You can tune the trade-off, such as to achieve higher Framework Availability by sacrificing the Framework Consistency:

  1. Decrease Pod TolerationSeconds for TaintBasedEvictions
  2. Increase Node Eviction Rate
  3. Set a small PodGracefulDeletionTimeoutSec
  4. Violate other guidelines mentioned in How to achieve ConsistencyGuarantees, such as manually force delete a problematic Pod.

See more in:

  1. PodGracefulDeletionTimeoutSec
  2. Pod Safety and Consistency Guarantees
  1. Usage
  2. Example: FrameworkBarrier Example, TensorFlow ParameterServer Training Example, etc.
  1. Usage
  2. Example: TensorFlow ParameterServer Training Example, etc.

Best Practice