Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Aug 25, 2022

This PR depends on #1114 + #1116 PRs being merged first.

It offers two alternatives for integrating support for installing GPU device plugin to a cluster with fake device support. Either by adding separate gpu_fakedev YAML directory which configures GPU plugin differently from start, or by providing kustomization overlay for existing GPU plugin YAML files.

(Once there's consensus which approach is better, I'll remove the other one.)

I'd prefer adding just an overlay for existing GPU plugin YAMLs, but I have not been able to force kustomize to order init containers as needed. Any advice?

PS. The whole picture and initial review comments are in the RFC PR #1104, from which this is split off.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 26, 2022

Updated the kustomize change to pass CI test. If kustomize cannot be made to work (order init containers as needed) that change (along with first 2 commits) is redundant.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 29, 2022

How the new YAMLs work:

Fake device generator (with suitable fake configuration file) is run as the first init container, and (existing) NFD hook init container produces NFD feature file instead installing hook binary to host. Sysfs and devfs content created by the fake device generator is mounted both to NFD hook init container and GPU plugin container.

Devfs needs to be a real host directory so that container runtime can mount the assigned (fake) device file to a (fake) workload, and workload can then report to test runner which device was assigned to it. Because current GPU plugin + container runtime require host and GPU plugin container paths to match, and container runtime does not allow creating files to regular sysfs/devfs paths, different paths need to be used for the fake ones.

@eero-t
Copy link
Contributor Author

eero-t commented Aug 29, 2022

If kustomize init container order cannot be forced so that fake device generator runs first, init container writing NFD features file could be changed into a regular container running forever (that's what needs to be done eventually any way, when NFD drops hook support).

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

@eero-t
Copy link
Contributor Author

eero-t commented Oct 27, 2022

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

@eero-t
Copy link
Contributor Author

eero-t commented Nov 23, 2022

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

Additional month passed. Any comments?

Note: fakedev-exporter project relies on this, as documented here: https://github.com/intel/fakedev-exporter/blob/main/deployments/README.md

@mythi
Copy link
Contributor

mythi commented Dec 7, 2022

Additional month passed. Any comments?

is it still WIP?

@eero-t
Copy link
Contributor Author

eero-t commented Dec 7, 2022

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

@mythi
Copy link
Contributor

mythi commented Dec 7, 2022

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

Oh OK, I've not looked at it at all since it says WIP :-)

Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the overlay base approach

@@ -0,0 +1,15 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kind: Kustomization
kind: Kustomization
nameSuffix: -fake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mythi I'm not really sure about adding -fake suffix to all objects: service, serviceAccount, clusterRole, clusterRoleBinding, configMap, daemonSet.

I think it would be rare to run both fake and real GPU plugin at the same time in the same cluster, but even if you would:

  • Only daemonSet would need a different name
  • Generator configMap is unique to fake device plugin deployment, so it does not need a new name
  • Rest are fine to be shared between real and fake GPU plugins

What if it would use a different namespace instead?

E.g. if one would want operator to support also fake plugin, wouldn't different namespace be easier for that, than every object having a different name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the suggestion since the "standalone" daemonset was named with that -fake. I'm OK not to use this suggestion if you think it make more sense as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to change just the deployment name with kustomize?

Even if one would use separate namespace for the whole thing, it may still make sense to have -fake suffix for the deployment name, just to make sure nobody confuses it with the real thing e.g. in kubectl get pods -A output.

In my own fake GPU plugin tests, I've used "validation" namespace for it, but something like "fake-validation" would make it clearer that it's not the real thing...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on that namespace topic, one idea could be to add namespace: somefake into kustomization.yaml

eero-t added a commit to eero-t/intel-device-plugins-for-kubernetes that referenced this pull request Dec 12, 2022
Adding new kustomization (base) directory for this is (a bit simpler)
alternative than kustomizing existing gpu_plugin/ base, which requires
changing current initcontainer to a run-time one, and adding init
container for fake device generator.

See: intel#1118

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Dec 12, 2022

+1 for the overlay base approach

@mythi Ok, I split the changes for a new gpu_fakedev kustomize dir to a separate branch:
https://github.com/eero-t/intel-device-plugins-for-kubernetes/commits/kustomize-gpu_fakedev

Merged the remaining commits to 2 and rebased it to latest main.

Except for the dropped large commit (+ a typo fix), content is same as earlier.

EDIT:

Pros for standalone daemonset:

  • Easy to specify init container order
  • Easy to rename existing things

Pros for kustomize overlay:

  • Automatically tracks GPU plugin deployment changes
  • Easier to take advantage of other GPU plugin overlays

@eero-t
Copy link
Contributor Author

eero-t commented Dec 12, 2022

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

eero-t added a commit to eero-t/intel-device-plugins-for-kubernetes that referenced this pull request Dec 12, 2022
Adding new kustomization (base) directory for this is (a bit simpler)
alternative than kustomizing existing gpu_plugin/ base, which requires
changing current initcontainer to a run-time one, and adding init
container for fake device generator.

See: intel#1118

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@mythi
Copy link
Contributor

mythi commented Dec 12, 2022

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

It's OK like this.

eero-t added a commit to eero-t/intel-device-plugins-for-kubernetes that referenced this pull request Dec 30, 2022
Adding new kustomization (base) directory for this is (a bit simpler)
alternative than kustomizing existing gpu_plugin/ base, which requires
changing current initcontainer to a run-time one, and adding init
container for fake device generator.

See: intel#1118

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Change GPU plugin NFD init container to run-time container:
* To work around kustomize inability to enforce correct init container order
* This is more likely how things will work once NFD drops support for hooks:
  kubernetes-sigs/node-feature-discovery#856

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Feb 13, 2023

Rebased this to main to drop merged #1327.

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

@mythi
Copy link
Contributor

mythi commented Apr 26, 2023

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

do we need to wait for the image?

@eero-t
Copy link
Contributor Author

eero-t commented Apr 26, 2023

do we need to wait for the image?

IMHO there's not much point in merging kustomize files that rely on non-existing image. Users are unlikely to do their own builds and kustomize the kustomize in this PR to point to their own registry...

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

@mythi
Copy link
Contributor

mythi commented Apr 26, 2023

IMHO there's not much point in merging kustomize files that rely on non-existing image.

We have several deployments in the repo without having the images available so this would not be an exception. I don't mind keeping this open but I think it's OK to merge as well.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 2, 2024

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

NFD & GPU plugin have moved from hooks to feature files, so this needs to be updated. I'm not sure when I'll have time for that though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants