Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collectd 6] gpu_sysman: change GPU metric labels to resource attributes #4300

Open
wants to merge 4 commits into
base: collectd-6.0
Choose a base branch
from

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Apr 17, 2024

ChangeLog: gpu_sysman: change GPU metric labels to resource attributes

Changes to Sysman plugin:

  • Move current GPU attributes from metric labels to resource attributes
  • Include additional GPU attributes recommended by the OTEL spec

Changes elsewhere:

  • resource.c: WA for GCC's warning bug when building gpu_sysman_test.c with more extensive GCC warnings

Split from #4267.

@eero-t eero-t added the Feature label Apr 17, 2024
@eero-t eero-t requested review from a team as code owners April 17, 2024 15:08
@collectd-bot collectd-bot added this to the 6.0 milestone Apr 17, 2024
@eero-t eero-t marked this pull request as draft April 17, 2024 15:08
@eero-t
Copy link
Contributor Author

eero-t commented Apr 17, 2024

Hm. On the two HWs / kernels I just tested, one did not report any FW info, and the other reported 2 of them. This PR reports info for just the last FW, as OTEL spec seems to assume there only being single FW => I'll file bug against OTEL spec.

(FW info support is almost half of the changes in PR.)

@eero-t
Copy link
Contributor Author

eero-t commented Apr 17, 2024

Prometheus adds exported_ prefix to job & instance labels, which can be remapped with serviceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: collectd
  namespace: monitoring
spec:
  endpoints:
  - metricRelabelings:
    - sourceLabels: [exported_instance]
      targetLabel: pci_bdf
    - action: labeldrop
      regex: '^exported_.*$'
  selector:
    matchLabels:
      app: collectd

This is in draft state because resource attributes make querying metrics from Prometheus much more complicated.

EDIT: In most cases I think one would be showing data from target_info separately, so maybe that is not so bad. Apparently there's also some activity on Prometheus side to help with this.

But if one does wants any of the resource attributes with the queried metrics, one needs to:

  • specify each of the labels separately in the query
  • for each metric

I.e. instead of getting all metrics for given $NODE from $SERVICE with query like this:
{node="$NODE", service="$SERVICE"}

One needs to now query each of them separately with a query like this:
collectd_gpu_sysman_temperature_celsius{node="$NODE"} * on (exported_instance) group_left($LABELS) target_info

Which gives items like this (when LABELS=model,pci_bdf,pci_dev):

      {
        "metric": {
          "container": "collectd",
          "endpoint": "http",
          "instance": "10.32.130.125:9103",
          "job": "collectd",
          "location": "gpu-max",
          "model": "Intel(R) Iris(R) Xe MAX Graphics",
          "namespace": "monitoring",
          "node": "testnode",
          "pci_bdf": "0000:0a:00.0",
          "pci_dev": "0x4905",
          "pod": "collectd-d9m6z",
          "service": "collectd"
        },
        "value": [
          1713363392.117,
          "30"
        ]
      }

@eero-t
Copy link
Contributor Author

eero-t commented Apr 17, 2024

Looking at the OTEL HW spec: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#common-hardware-attributes

It seems that all HW devices should provide also id and name attrributes, in addition to similar GPU model and serial attributes?

FYI: I've raised these and the earlier noticed issues in following OTEL spec ticket: open-telemetry/semantic-conventions#940

@eero-t
Copy link
Contributor Author

eero-t commented Apr 30, 2024

Updated code according to feedback given in OTEL bug tracker, which dealt with the issues I raised above.

Will keep this still as "Draft" though, until #4299 is merged, the discussion in OTEL ticket has winded down and I've tested this a bit more.

Only "job" (which is set to "collectd" service name), and "instance"
labels can be used to tie resource attributes in "target_info" to the
metrics.

As metrics do not have any more explicit "pci_bdf" / "dev_file"
labels, Prometheus plugin would skip metrics with the same name unless
there's "instance" label identifying them having a different origin.

=> Use "pci_bdf" as "service.instance.id" (the unique device ID).

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Device UUID is provided as "id" and "model" as "name" attribute,
as those are required & recommended OTEL attributes for all HW:
https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
OTEL "firmware_version" attribute value is a free-form text
containing info on all firmwares (returned by L0 API):
open-telemetry/semantic-conventions#940

To test this, change enumeration and property function generator
macros to support multiple handles for property arrays.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
When file includes both common.c & resource.c, i.e. GCC (13.2.1)
sees how latter uses functions from former, there's:
-------------------------------------------------------------
utils/common/common.c: In function ‘otel_resource_attributes’:
utils/common/common.c:92:3: error: ‘__builtin_strncpy’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
   92 |   strncpy(dest, src != NULL ? src : "", n - 1);
      |   ^
-------------------------------------------------------------
Although the next line from sstrncpy() _does_ set terminator...

As this is the only place I've seen this, increase stack array size to
keep the otherwise useful GCC warning option.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t eero-t marked this pull request as ready for review May 20, 2024 15:16
@eero-t
Copy link
Contributor Author

eero-t commented May 20, 2024

Questions answered in OTEL ticket and code changed to match => dropped Draft status.

Rebased GCC warning WA & FW attribute commits to be last ones, in case they should be dropped from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants