[@parallel on Kubernetes] support for Jobsets #1804

valayDave · 2024-04-18T00:28:13Z

Implementation originates from [#1744] but now supersedes the PR.

This commit adds support for @parallel when flows are run --with kubernetes Support for Argo workflows will follow in a separate commit.

A user can run a flow with the following:

@step
def start(self):
    self.next(self.parallel_step, num_parallel=3)

@kubernetes(cpu=1, memory=512)
@parallel
@step
def parallel_step(self):
...

metaflow/plugins/kubernetes/kubernetes_job.py

valayDave · 2024-05-06T23:22:59Z

metaflow/plugins/kubernetes/kubernetes_decorator.py

+        num_parallel = None
+        if hasattr(flow, "_parallel_ubf_iter"):
+            num_parallel = flow._parallel_ubf_iter.num_parallel
+
+        if num_parallel and num_parallel >= 1 and ubf_context == UBF_CONTROL:
+            control_task_id, worker_task_ids = TaskIdConstructor.join_step_task_ids(
+                num_parallel
+            )
+            mapper_task_ids = [control_task_id] + worker_task_ids
+            flow._control_mapper_tasks = [
+                "%s/%s/%s" % (run_id, step_name, mapper_task_id)
+                for mapper_task_id in mapper_task_ids
+            ]
+            flow._control_task_is_mapper_zero = True
+
+        if num_parallel and num_parallel > 1:
+            _setup_multinode_environment()


Needed so that Join steps has all the relevant task-ids.

valayDave · 2024-05-06T23:24:04Z

metaflow/plugins/kubernetes/kubernetes_job.py

+    def create_job_spec(self):
+        client = self._client.get()


has to create a sub-function called create_job_spec so we could reuse the jobspec created for K8s jobs and plug that in directly into Jobsets.

metaflow/plugins/kubernetes/kubernetes_job.py

valayDave · 2024-05-06T23:25:15Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+    return overall_status, control_exit_code, control_pod_failed
+
+
+def _construct_jobset_logical_status(jobset, control_pod=None):


Main function which helps compute the logical status of the jobset.

valayDave · 2024-05-06T23:25:46Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+    )
+
+
+class RunningJobSet(object):


Interface similar to RunningJob so that the runtime process can monitor the jobset.

valayDave · 2024-05-06T23:26:10Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+        ).jobset_failed
+
+
+class TaskIdConstructor:


This helps constructing all the task-ids from one place

Excellent! Thanks for this implementation. Otherwise, it could get hard figuring out how/where the task ids get constructed.

valayDave · 2024-05-06T23:27:30Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+    )
+
+
+class KubernetesJobSet(object):


It creates one control and one worker definition. The workers will have replicas set to num_parallel-1. All workers and control will leverage the "jobspec" created by the KubernetesJob interface.

metaflow/plugins/kubernetes/kubernetes_jobsets.py

valayDave · 2024-05-06T23:28:47Z

metaflow/plugins/kubernetes/kubernetes_client.py

@@ -4,7 +4,7 @@

 from metaflow.exception import MetaflowException

-from .kubernetes_job import KubernetesJob
+from .kubernetes_job import KubernetesJob, KubernetesJobSet


Need this import for Kubernetes clients which maybe getting used via extensions.

shrinandj · 2024-05-08T20:55:04Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+)
+
+
+def _basic_validation_for_js(jobset):


metaflow/plugins/kubernetes/kubernetes_job.py

valayDave · 2024-05-09T00:00:36Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+            control_pod_status=None,
+            worker_pods_failed=False,
+            control_pod_failed=False,
+            some_jobs_are_running=False,


some_jobs_are_running helps derive if something is running or not.

valayDave · 2024-05-09T00:02:14Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+        self._group = KUBERNETES_JOBSET_GROUP
+        self._version = KUBERNETES_JOBSET_VERSION


Making this configurable so that we can have code paths that deal with different GROUP /VERSIONS based on what they are set in the config.

valayDave · 2024-05-09T00:02:32Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+            num_parallel=num_parallel,
+            namespace=namespace,
+        )
+        worker_task_id = TaskIdConstructor.jobset_worker_id(task_id)


Single place where we construct the Task-id

valayDave · 2024-05-20T18:17:23Z

metaflow/plugins/kubernetes/kubernetes_cli.py

+    if num_parallel is not None and num_parallel <= 1:
+        raise KubernetesException(
+            "Using @parallel with `num_parallel` <= 1 is not supported with Kubernetes. "
+            "Please set the value of `num_parallel` to be greater than 1."
+        )
+


constraint added because jobset doesn't play nice with replicas = 0; Once kubernetes-sigs/jobset allow this, we can lift this constraint and add version logic to verify if the jobset can be submitted or not. Currently not supported with Jobset CRD version jobset.x-k8s.io/v1alpha2

@Retry

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Some notes about the implementation: - No annotations for task-id in pods since We cannot dynamically construct the task-id during K8s container runtime. - @catch is currently not supported with @parallel on kubernetes - metadata about jobset name exists in the task-metadata - The jobset will contain two job definitions; One for control and one for worker. - The worker will have n-1 replicas created. - We construct the worker task-id determininstically using naming conventions and shell hacking. - Jobset is considered running even if one job amongst all of them are running. - @Retry will work with jobset - num_parallel <=1 will NOT be supported to start with; - One core reason is that jobsets don't allow setting replicas to 0; - jobsets controller will mutate a jobset with replica set to 0 with replicas set to 1. - The implementation accounts for Jobset CRD schema from v0.2.0 - Jobset team changed the schema (just renaming values) after v0.3.0 - The changes were to `replicatedJobsStatus` where certain fields were added and `ReplicatedJobsStatus` was renamed to `replicatedJobsStatus`

valayDave · 2024-05-20T20:40:42Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+def _retrieve_replicated_job_statuses(jobset):
+    # We needed this abstraction because Jobsets changed thier schema
+    # in version v0.3.0 where `ReplicatedJobsStatus` became `replicatedJobsStatus`
+    # So to handle users having an older version of jobsets, we need to account
+    # for both the schemas.
+    if jobset.get("status", {}).get("replicatedJobsStatus", None):
+        return jobset.get("status").get("replicatedJobsStatus")
+    elif jobset.get("status", {}).get("ReplicatedJobsStatus", None):
+        return jobset.get("status").get("ReplicatedJobsStatus")
+    return None


Needed to add a conditional block to swallow the complexity for the user when the jobset controller may be of a version < 0.3.0. This is because jobset's schema had changed from v0.3.0. No logical changes to the schema, only additions and renaming of values like ReplicatedJobsStatus to replicatedJobsStatus

valayDave · 2024-05-20T20:42:00Z

metaflow/plugins/kubernetes/kubernetes_jobsets.py

+            if "suspended" in job_status:
+                # `replicatedJobStatus` didn't have `suspend` field
+                #  until v0.3.0. So we need to account for that.
+                workers_are_suspended = job_status["suspended"] > 0


suspend was also not a supported keyword in Jobset for version v0.2.0; This is why we need a conditional check over here.

shrinandj reviewed Apr 18, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

shrinandj reviewed Apr 18, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

shrinandj reviewed Apr 19, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

shrinandj reviewed Apr 19, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

shrinandj reviewed Apr 19, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

savingoyal reviewed Apr 24, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

savingoyal reviewed Apr 24, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

valayDave force-pushed the valay/oss-jobsets branch from 302ae6a to 93f8a11 Compare May 6, 2024 23:21

valayDave commented May 6, 2024

View reviewed changes

shrinandj reviewed May 8, 2024

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_jobsets.py

)

def _basic_validation_for_js(jobset):

Copy link

Contributor

shrinandj May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

shrinandj approved these changes May 8, 2024

View reviewed changes

valayDave force-pushed the valay/oss-jobsets branch from 93f8a11 to a25f980 Compare May 8, 2024 23:50

valayDave commented May 9, 2024

View reviewed changes

valayDave force-pushed the valay/oss-jobsets branch 2 times, most recently from 1bc4c17 to 8ca8cd3 Compare May 13, 2024 18:18

valayDave force-pushed the valay/oss-jobsets branch from 8ca8cd3 to 41bf331 Compare May 20, 2024 17:50

valayDave commented May 20, 2024

View reviewed changes

valayDave force-pushed the valay/oss-jobsets branch from 41bf331 to a8207c7 Compare May 20, 2024 20:30

valayDave commented May 20, 2024

View reviewed changes

savingoyal approved these changes May 20, 2024

View reviewed changes

savingoyal merged commit 1a38256 into Netflix:master May 20, 2024
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[@parallel on Kubernetes] support for Jobsets #1804

[@parallel on Kubernetes] support for Jobsets #1804

valayDave commented Apr 18, 2024 •

edited

valayDave May 6, 2024

valayDave May 6, 2024

valayDave May 6, 2024

valayDave May 6, 2024

valayDave May 6, 2024

shrinandj May 8, 2024

valayDave May 6, 2024 •

edited

valayDave May 6, 2024

shrinandj May 8, 2024

valayDave May 9, 2024

valayDave May 9, 2024

valayDave May 9, 2024

valayDave May 20, 2024

valayDave May 20, 2024

valayDave May 20, 2024

		return overall_status, control_exit_code, control_pod_failed


		def _construct_jobset_logical_status(jobset, control_pod=None):

		self._group = KUBERNETES_JOBSET_GROUP
		self._version = KUBERNETES_JOBSET_VERSION

		)


		class RunningJobSet(object):

		)


		class KubernetesJobSet(object):

[@parallel on Kubernetes] support for Jobsets #1804

[@parallel on Kubernetes] support for Jobsets #1804

Conversation

valayDave commented Apr 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valayDave May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valayDave commented Apr 18, 2024 •

edited

valayDave May 6, 2024 •

edited