Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

klueska · 2024-05-16T23:18:45Z

No description provided.

elezar

I think we may be missing some System* functions for the 12.3 bump.

A general question is whether we want to separate this into separate PRs to better track the version updates? (We need not release each of these, but we could)

elezar · 2024-05-17T08:08:46Z

pkg/nvml/const.go

@@ -78,18 +78,6 @@ const (
 	VGPU_NAME_BUFFER_SIZE = 64
 	// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h
 	GRID_LICENSE_FEATURE_MAX_COUNT = 3
-	// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h
-	VGPU_SCHEDULER_POLICY_UNKNOWN = 0


Although this is due to the nvml.h change, this is breaking from the point of view of consumers -- especially if we consder that we may be loading and older libnvidia-ml.so version that still has these defined.

NVML only guarantees backwards compatibility within a major release. Are you saying we want to have a stronger guarantee?

elezar · 2024-05-17T08:10:28Z

pkg/nvml/device.go

@@ -2654,48 +2654,6 @@ func (device nvmlDevice) GetVgpuCapabilities(capability DeviceVgpuCapability) (b
 	return (capResult != 0), ret
 }

-// nvml.DeviceGetVgpuSchedulerLog()
-func (l *library) DeviceGetVgpuSchedulerLog(device Device) (VgpuSchedulerLog, Return) {


Do these changes mean that we should do each of the verison bumps as separate PRs? At least we then have the option to branch at the 12.1 update and apply fixes there.

Either that or just tag the appropriate commits

I suppose they're still available even after a merge.

elezar · 2024-05-17T08:11:54Z

pkg/nvml/zz_generated.api.go

@@ -184,9 +184,6 @@ var (
 	DeviceGetVgpuCapabilities                       = libnvml.DeviceGetVgpuCapabilities
 	DeviceGetVgpuMetadata                           = libnvml.DeviceGetVgpuMetadata
 	DeviceGetVgpuProcessUtilization                 = libnvml.DeviceGetVgpuProcessUtilization
-	DeviceGetVgpuSchedulerCapabilities              = libnvml.DeviceGetVgpuSchedulerCapabilities


Since we're generating APIs now, could we call out the removal of these functions in a changelog / release notes? (more of a nice-to-have)

They were added back in. There were however, 2 legitimate removals:

- CcuGetStreamState() (int, Return) - CcuSetStreamState(int) Return

Not sure if these were intentional...

elezar · 2024-05-17T08:14:03Z

pkg/nvml/const.go

@@ -78,6 +78,24 @@ const (
 	VGPU_NAME_BUFFER_SIZE = 64
 	// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h
 	GRID_LICENSE_FEATURE_MAX_COUNT = 3
+	// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h
+	VGPU_SCHEDULER_POLICY_UNKNOWN = 0


OK. I see, they re-added them here.

Yeah, they were removed in 12.1 for some reason and then readded in 12.2

elezar · 2024-05-17T08:15:19Z

pkg/nvml/nvml.go

@@ -275,6 +275,15 @@ func nvmlDeviceGetSerial(nvmlDevice nvmlDevice, Serial *byte, Length uint32) Ret
 	return __v
 }

+// nvmlDeviceGetModuleId function as declared in nvml/nvml.h
+func nvmlDeviceGetModuleId(nvmlDevice nvmlDevice, ModuleId *uint32) Return {


Note to self: Can we update c-for-go to return moduleId here instead of ModuleId?

I'm not sure. We could / should look into it.

elezar · 2024-05-17T08:20:00Z

pkg/nvml/nvml.go

@@ -102,6 +102,25 @@ func nvmlSystemGetProcessName(Pid uint32, Name *byte, Length uint32) Return {
 	return __v
 }

+// nvmlSystemGetHicVersion function as declared in nvml/nvml.h
+func nvmlSystemGetHicVersion(HwbcCount *uint32, HwbcEntries *HwbcEntry) Return {


I don't see a corresponding top-level or interface implementation for this.

Its there already, actually:
https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/system.go#L53

In each of these versions bumps a whole bunch of functions were moved around, so there are a bunch of +s and -s on functions that are really just moves within the file.

elezar · 2024-05-17T08:21:23Z

pkg/nvml/nvml.go

@@ -351,6 +351,15 @@ func nvmlDeviceClearCpuAffinity(nvmlDevice nvmlDevice) Return {
 	return __v
 }

+// nvmlDeviceGetNumaNodeId function as declared in nvml/nvml.h
+func nvmlDeviceGetNumaNodeId(nvmlDevice nvmlDevice, Node *uint32) Return {


Question: Does this mean we don't need to use the heuristics that we currently use?

I would say so. But we need to be aware that this is only available in later releases.

Yes, we would have to support a fallback.

klueska · 2024-05-21T16:11:08Z

Do you have further comments here? Or more comments to add to my responses?

elezar · 2024-05-21T17:20:56Z

Do you have further comments here? Or more comments to add to my responses?

No. Looks good. Thanks for the responses.

klueska · 2024-05-21T17:52:51Z

Let's fix the GPM metrics before merging this so that the change appears in all of the versions 12.0-12.4

This is technically a breaking change, but I can't imagine there being anyone who creates a variable to one of these types. If they do, I imagine they use the `:=` syntax and not the explicit `var` syntax, so they won't be naming the tspe anyway. I'm OK breaking the 1 person who this might affect. Signed-off-by: Kevin Klues <kklues@nvidia.com>

Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska force-pushed the bump-to-latest-nvml branch from 3e215c6 to e8f227c Compare May 16, 2024 23:24

elezar requested changes May 17, 2024

View reviewed changes

elezar approved these changes May 21, 2024

View reviewed changes

klueska added 5 commits May 24, 2024 13:48

Bump to NVML version 12.1

9fc23a0

Signed-off-by: Kevin Klues <kklues@nvidia.com>

Bump to NVML version 12.2

075f45f

Signed-off-by: Kevin Klues <kklues@nvidia.com>

Bump to NVML version 12.3

b57740a

Signed-off-by: Kevin Klues <kklues@nvidia.com>

Bump to NVML version 12.4

f7ceebd

Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska force-pushed the bump-to-latest-nvml branch from e8f227c to f7ceebd Compare May 24, 2024 13:49

elezar mentioned this pull request May 24, 2024

When is v535 supported #101

Open

klueska merged commit 852755d into NVIDIA:main May 24, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

klueska commented May 16, 2024

elezar left a comment

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska May 17, 2024

elezar May 17, 2024

klueska commented May 21, 2024

elezar commented May 21, 2024

klueska commented May 21, 2024

Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

Conversation

klueska commented May 16, 2024

elezar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska commented May 21, 2024

elezar commented May 21, 2024

klueska commented May 21, 2024