Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123

Merged
merged 5 commits into from
May 24, 2024

Conversation

klueska
Copy link
Contributor

@klueska klueska commented May 16, 2024

No description provided.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be missing some System* functions for the 12.3 bump.

A general question is whether we want to separate this into separate PRs to better track the version updates? (We need not release each of these, but we could)

@@ -78,18 +78,6 @@ const (
VGPU_NAME_BUFFER_SIZE = 64
// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h
GRID_LICENSE_FEATURE_MAX_COUNT = 3
// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h
VGPU_SCHEDULER_POLICY_UNKNOWN = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is due to the nvml.h change, this is breaking from the point of view of consumers -- especially if we consder that we may be loading and older libnvidia-ml.so version that still has these defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVML only guarantees backwards compatibility within a major release. Are you saying we want to have a stronger guarantee?

@@ -2654,48 +2654,6 @@ func (device nvmlDevice) GetVgpuCapabilities(capability DeviceVgpuCapability) (b
return (capResult != 0), ret
}

// nvml.DeviceGetVgpuSchedulerLog()
func (l *library) DeviceGetVgpuSchedulerLog(device Device) (VgpuSchedulerLog, Return) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these changes mean that we should do each of the verison bumps as separate PRs? At least we then have the option to branch at the 12.1 update and apply fixes there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that or just tag the appropriate commits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose they're still available even after a merge.

@@ -184,9 +184,6 @@ var (
DeviceGetVgpuCapabilities = libnvml.DeviceGetVgpuCapabilities
DeviceGetVgpuMetadata = libnvml.DeviceGetVgpuMetadata
DeviceGetVgpuProcessUtilization = libnvml.DeviceGetVgpuProcessUtilization
DeviceGetVgpuSchedulerCapabilities = libnvml.DeviceGetVgpuSchedulerCapabilities
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're generating APIs now, could we call out the removal of these functions in a changelog / release notes? (more of a nice-to-have)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were added back in. There were however, 2 legitimate removals:

-   CcuGetStreamState() (int, Return)
-   CcuSetStreamState(int) Return

Not sure if these were intentional...

@@ -78,6 +78,24 @@ const (
VGPU_NAME_BUFFER_SIZE = 64
// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h
GRID_LICENSE_FEATURE_MAX_COUNT = 3
// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h
VGPU_SCHEDULER_POLICY_UNKNOWN = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I see, they re-added them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they were removed in 12.1 for some reason and then readded in 12.2

@@ -275,6 +275,15 @@ func nvmlDeviceGetSerial(nvmlDevice nvmlDevice, Serial *byte, Length uint32) Ret
return __v
}

// nvmlDeviceGetModuleId function as declared in nvml/nvml.h
func nvmlDeviceGetModuleId(nvmlDevice nvmlDevice, ModuleId *uint32) Return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Can we update c-for-go to return moduleId here instead of ModuleId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. We could / should look into it.

@@ -102,6 +102,25 @@ func nvmlSystemGetProcessName(Pid uint32, Name *byte, Length uint32) Return {
return __v
}

// nvmlSystemGetHicVersion function as declared in nvml/nvml.h
func nvmlSystemGetHicVersion(HwbcCount *uint32, HwbcEntries *HwbcEntry) Return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a corresponding top-level or interface implementation for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its there already, actually:
https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/system.go#L53

In each of these versions bumps a whole bunch of functions were moved around, so there are a bunch of +s and -s on functions that are really just moves within the file.

@@ -351,6 +351,15 @@ func nvmlDeviceClearCpuAffinity(nvmlDevice nvmlDevice) Return {
return __v
}

// nvmlDeviceGetNumaNodeId function as declared in nvml/nvml.h
func nvmlDeviceGetNumaNodeId(nvmlDevice nvmlDevice, Node *uint32) Return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this mean we don't need to use the heuristics that we currently use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say so. But we need to be aware that this is only available in later releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would have to support a fallback.

@klueska
Copy link
Contributor Author

klueska commented May 21, 2024

Do you have further comments here? Or more comments to add to my responses?

@elezar
Copy link
Member

elezar commented May 21, 2024

Do you have further comments here? Or more comments to add to my responses?

No. Looks good. Thanks for the responses.

@klueska
Copy link
Contributor Author

klueska commented May 21, 2024

Let's fix the GPM metrics before merging this so that the change appears in all of the versions 12.0-12.4

This is technically a breaking change, but I can't imagine there being anyone
who creates a variable to one of these types. If they do, I imagine they use
the `:=` syntax and not the explicit `var` syntax, so they won't be naming the
tspe anyway. I'm OK breaking the 1 person who this might affect.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
@elezar elezar mentioned this pull request May 24, 2024
@klueska klueska merged commit 852755d into NVIDIA:main May 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants