New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[collectd 6] gpu_sysman: change GPU metric labels to resource attributes #4300
base: collectd-6.0
Are you sure you want to change the base?
Conversation
Hm. On the two HWs / kernels I just tested, one did not report any FW info, and the other reported 2 of them. This PR reports info for just the last FW, as OTEL spec seems to assume there only being single FW => I'll file bug against OTEL spec. (FW info support is almost half of the changes in PR.) |
Prometheus adds
This is in draft state because resource attributes make querying metrics from Prometheus much more complicated. EDIT: In most cases I think one would be showing data from But if one does wants any of the resource attributes with the queried metrics, one needs to:
I.e. instead of getting all metrics for given One needs to now query each of them separately with a query like this: Which gives items like this (when
|
Looking at the OTEL HW spec: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#common-hardware-attributes It seems that all HW devices should provide also FYI: I've raised these and the earlier noticed issues in following OTEL spec ticket: open-telemetry/semantic-conventions#940 |
Updated code according to feedback given in OTEL bug tracker, which dealt with the issues I raised above. Will keep this still as "Draft" though, until #4299 is merged, the discussion in OTEL ticket has winded down and I've tested this a bit more. |
Only "job" (which is set to "collectd" service name), and "instance" labels can be used to tie resource attributes in "target_info" to the metrics. As metrics do not have any more explicit "pci_bdf" / "dev_file" labels, Prometheus plugin would skip metrics with the same name unless there's "instance" label identifying them having a different origin. => Use "pci_bdf" as "service.instance.id" (the unique device ID). Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Device UUID is provided as "id" and "model" as "name" attribute, as those are required & recommended OTEL attributes for all HW: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/ Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
OTEL "firmware_version" attribute value is a free-form text containing info on all firmwares (returned by L0 API): open-telemetry/semantic-conventions#940 To test this, change enumeration and property function generator macros to support multiple handles for property arrays. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
When file includes both common.c & resource.c, i.e. GCC (13.2.1) sees how latter uses functions from former, there's: ------------------------------------------------------------- utils/common/common.c: In function ‘otel_resource_attributes’: utils/common/common.c:92:3: error: ‘__builtin_strncpy’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation] 92 | strncpy(dest, src != NULL ? src : "", n - 1); | ^ ------------------------------------------------------------- Although the next line from sstrncpy() _does_ set terminator... As this is the only place I've seen this, increase stack array size to keep the otherwise useful GCC warning option. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Questions answered in OTEL ticket and code changed to match => dropped Draft status. Rebased GCC warning WA & FW attribute commits to be last ones, in case they should be dropped from this PR. |
ChangeLog: gpu_sysman: change GPU metric labels to resource attributes
Changes to Sysman plugin:
Changes elsewhere:
resource.c
: WA for GCC's warning bug when buildinggpu_sysman_test.c
with more extensive GCC warningsSplit from #4267.