WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

eero-t · 2022-08-25T17:58:30Z

This PR depends on #1114 + #1116 PRs being merged first.

It offers two alternatives for integrating support for installing GPU device plugin to a cluster with fake device support. Either by adding separate gpu_fakedev YAML directory which configures GPU plugin differently from start, or by providing kustomization overlay for existing GPU plugin YAML files.

(Once there's consensus which approach is better, I'll remove the other one.)

I'd prefer adding just an overlay for existing GPU plugin YAMLs, but I have not been able to force kustomize to order init containers as needed. Any advice?

PS. The whole picture and initial review comments are in the RFC PR #1104, from which this is split off.

eero-t · 2022-08-26T16:29:04Z

Updated the kustomize change to pass CI test. If kustomize cannot be made to work (order init containers as needed) that change (along with first 2 commits) is redundant.

eero-t · 2022-08-29T08:35:50Z

How the new YAMLs work:

Fake device generator (with suitable fake configuration file) is run as the first init container, and (existing) NFD hook init container produces NFD feature file instead installing hook binary to host. Sysfs and devfs content created by the fake device generator is mounted both to NFD hook init container and GPU plugin container.

Devfs needs to be a real host directory so that container runtime can mount the assigned (fake) device file to a (fake) workload, and workload can then report to test runner which device was assigned to it. Because current GPU plugin + container runtime require host and GPU plugin container paths to match, and container runtime does not allow creating files to regular sysfs/devfs paths, different paths need to be used for the fake ones.

eero-t · 2022-08-29T10:43:22Z

If kustomize init container order cannot be forced so that fake device generator runs first, init container writing NFD features file could be changed into a regular container running forever (that's what needs to be done eventually any way, when NFD drops hook support).

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

eero-t · 2022-10-27T15:31:02Z

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

eero-t · 2022-11-23T15:15:35Z

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

Additional month passed. Any comments?

Note: fakedev-exporter project relies on this, as documented here: https://github.com/intel/fakedev-exporter/blob/main/deployments/README.md

mythi · 2022-12-07T17:05:27Z

Additional month passed. Any comments?

is it still WIP?

eero-t · 2022-12-07T17:20:46Z

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

mythi · 2022-12-07T17:34:07Z

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

Oh OK, I've not looked at it at all since it says WIP :-)

mythi

+1 for the overlay base approach

mythi · 2022-12-08T06:14:41Z

deployments/gpu_plugin/overlays/fake_devices/kustomization.yaml

@@ -0,0 +1,15 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization


Suggested change

kind: Kustomization

kind: Kustomization

nameSuffix: -fake

@mythi I'm not really sure about adding -fake suffix to all objects: service, serviceAccount, clusterRole, clusterRoleBinding, configMap, daemonSet.

I think it would be rare to run both fake and real GPU plugin at the same time in the same cluster, but even if you would:

Only daemonSet would need a different name

Generator configMap is unique to fake device plugin deployment, so it does not need a new name

Rest are fine to be shared between real and fake GPU plugins

What if it would use a different namespace instead?

E.g. if one would want operator to support also fake plugin, wouldn't different namespace be easier for that, than every object having a different name?

I added the suggestion since the "standalone" daemonset was named with that -fake. I'm OK not to use this suggestion if you think it make more sense as it is.

Is there a way to change just the deployment name with kustomize?

Even if one would use separate namespace for the whole thing, it may still make sense to have -fake suffix for the deployment name, just to make sure nobody confuses it with the real thing e.g. in kubectl get pods -A output.

In my own fake GPU plugin tests, I've used "validation" namespace for it, but something like "fake-validation" would make it clearer that it's not the real thing...

on that namespace topic, one idea could be to add namespace: somefake into kustomization.yaml

Adding new kustomization (base) directory for this is (a bit simpler) alternative than kustomizing existing gpu_plugin/ base, which requires changing current initcontainer to a run-time one, and adding init container for fake device generator. See: intel#1118 Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

eero-t · 2022-12-12T14:06:52Z

+1 for the overlay base approach

@mythi Ok, I split the changes for a new gpu_fakedev kustomize dir to a separate branch:
https://github.com/eero-t/intel-device-plugins-for-kubernetes/commits/kustomize-gpu_fakedev

Merged the remaining commits to 2 and rebased it to latest main.

Except for the dropped large commit (+ a typo fix), content is same as earlier.

EDIT:

Pros for standalone daemonset:

Easy to specify init container order
Easy to rename existing things

Pros for kustomize overlay:

Automatically tracks GPU plugin deployment changes
Easier to take advantage of other GPU plugin overlays

eero-t · 2022-12-12T14:09:52Z

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

Adding new kustomization (base) directory for this is (a bit simpler) alternative than kustomizing existing gpu_plugin/ base, which requires changing current initcontainer to a run-time one, and adding init container for fake device generator. See: intel#1118 Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

mythi · 2022-12-12T16:58:03Z

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

It's OK like this.

Adding new kustomization (base) directory for this is (a bit simpler) alternative than kustomizing existing gpu_plugin/ base, which requires changing current initcontainer to a run-time one, and adding init container for fake device generator. See: intel#1118 Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

Change GPU plugin NFD init container to run-time container: * To work around kustomize inability to enforce correct init container order * This is more likely how things will work once NFD drops support for hooks: kubernetes-sigs/node-feature-discovery#856 Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

eero-t · 2023-02-13T14:24:48Z

Rebased this to main to drop merged #1327.

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

mythi · 2023-04-26T12:48:24Z

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

do we need to wait for the image?

eero-t · 2023-04-26T14:04:30Z

do we need to wait for the image?

IMHO there's not much point in merging kustomize files that rely on non-existing image. Users are unlikely to do their own builds and kustomize the kustomize in this PR to point to their own registry...

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

mythi · 2023-04-26T18:11:57Z

IMHO there's not much point in merging kustomize files that rely on non-existing image.

We have several deployments in the repo without having the images available so this would not be an exception. I don't mind keeping this open but I think it's OK to merge as well.

eero-t · 2024-01-02T17:18:57Z

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

NFD & GPU plugin have moved from hooks to feature files, so this needs to be updated. I'm not sure when I'll have time for that though.

eero-t requested review from hj-johannes-lee, bart0sh, mythi and tkatila as code owners August 25, 2022 17:58

eero-t force-pushed the gpu-plugin-fake branch from fa7cfa7 to eeb989d Compare August 29, 2022 13:31

mythi reviewed Dec 8, 2022

View reviewed changes

eero-t force-pushed the gpu-plugin-fake branch from eeb989d to 7e7c54d Compare December 12, 2022 14:00

eero-t mentioned this pull request Feb 8, 2023

Use more generic name for NFD features host directory volume #1327

Merged

eero-t force-pushed the gpu-plugin-fake branch from 7e7c54d to da4fb39 Compare February 13, 2023 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

eero-t commented Aug 25, 2022

eero-t commented Aug 26, 2022

eero-t commented Aug 29, 2022 •

edited

eero-t commented Aug 29, 2022

eero-t commented Oct 27, 2022

eero-t commented Nov 23, 2022

mythi commented Dec 7, 2022

eero-t commented Dec 7, 2022 •

edited

mythi commented Dec 7, 2022

mythi left a comment

mythi Dec 8, 2022

eero-t Dec 12, 2022

mythi Dec 12, 2022

eero-t Dec 12, 2022

mythi Dec 12, 2022

eero-t commented Dec 12, 2022 •

edited

eero-t commented Dec 12, 2022

mythi commented Dec 12, 2022

eero-t commented Feb 13, 2023

mythi commented Apr 26, 2023

eero-t commented Apr 26, 2023

mythi commented Apr 26, 2023

eero-t commented Jan 2, 2024

		@@ -0,0 +1,15 @@
		apiVersion: kustomize.config.k8s.io/v1beta1
		kind: Kustomization

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

Are you sure you want to change the base?

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices #1118

Conversation

eero-t commented Aug 25, 2022

eero-t commented Aug 26, 2022

eero-t commented Aug 29, 2022 • edited

eero-t commented Aug 29, 2022

eero-t commented Oct 27, 2022

eero-t commented Nov 23, 2022

mythi commented Dec 7, 2022

eero-t commented Dec 7, 2022 • edited

mythi commented Dec 7, 2022

mythi left a comment

Choose a reason for hiding this comment

mythi Dec 8, 2022

Choose a reason for hiding this comment

eero-t Dec 12, 2022

Choose a reason for hiding this comment

mythi Dec 12, 2022

Choose a reason for hiding this comment

eero-t Dec 12, 2022

Choose a reason for hiding this comment

mythi Dec 12, 2022

Choose a reason for hiding this comment

eero-t commented Dec 12, 2022 • edited

eero-t commented Dec 12, 2022

mythi commented Dec 12, 2022

eero-t commented Feb 13, 2023

mythi commented Apr 26, 2023

eero-t commented Apr 26, 2023

mythi commented Apr 26, 2023

eero-t commented Jan 2, 2024

eero-t commented Aug 29, 2022 •

edited

eero-t commented Dec 7, 2022 •

edited

eero-t commented Dec 12, 2022 •

edited