Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the knowledge about Launcher and Worker to CRD #519

Open
tenzen-y opened this issue Feb 7, 2023 · 3 comments
Open

Set the knowledge about Launcher and Worker to CRD #519

tenzen-y opened this issue Feb 7, 2023 · 3 comments

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Feb 7, 2023

/kind feature

We have API violations related to maps in the following:

// MPIReplicaSpecs contains maps from `MPIReplicaType` to `ReplicaSpec` that
// specify the MPI replicas to run.
MPIReplicaSpecs map[MPIReplicaType]*common.ReplicaSpec `json:"mpiReplicaSpecs"`

// replicaStatuses is map of ReplicaType and ReplicaStatus,
// specifies the status of each replica.
// +optional
ReplicaStatuses map[MPIReplicaType]*ReplicaStatus `json:"replicaStatuses,omitempty"`

So we should use named subobjects instead of maps in the following:

type MPIReplicaSpecs struct {
  Launcher common.ReplicaSpec `json:"launcher"`
  Worker common.ReplicaSpec     `json:"worker"`
}
type MPIReplicaStatuses struct {
  Launcher *ReplicaStatus `json:"launcher"`
  Worker *ReplicaStatus     `json:"worker"`
}

However, we should work on this in the new API version such as v2beta2 or v2 since this change doesn't have backward compatibility.

Original comments by @alculquicondor.

We are losing the knowledge about Launcher and Worker here.
But this is a problem with the API structs themselves.

We shouldn't be using a map https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#lists-of-named-subobjects-preferred-over-maps

Maybe we can fix it in a v2(beta2) API, but it would be nice to fix other training job objects, like so:

type MPIReplicaSpecs struct {
  Launcher common.ReplicaSpec `json:"launcher"`
  Worker common.ReplicaSpec `json:"worker"`
 }

#510 (comment)

another violation of API conventions T_T
But we can only change it in a new API version.

#514 (comment)

@alculquicondor
Copy link
Collaborator

I wouldn't advice using lists in these 2 cases, because the roles are not going to change. They will always be launcher and workers, so we should reflect that in a struct.

@alculquicondor
Copy link
Collaborator

It's also worth changing training-operator at the same time

@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 7, 2023

I wouldn't advice using lists in these 2 cases, because the roles are not going to change. They will always be launcher and workers, so we should reflect that in a struct.

Oh... Yes, that's right. Updated the description.

It's also worth changing training-operator at the same time

I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants