Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog: discover host inventory tags from environment or metrics stream #29700

Open
ringerc opened this issue Dec 8, 2023 · 9 comments
Open
Labels
enhancement New feature or request exporter/datadog Datadog components priority:p2 Medium Stale

Comments

@ringerc
Copy link

ringerc commented Dec 8, 2023

Component(s)

exporter/datadog

Is your feature request related to a problem? Please describe.

When the datadog exporter sends a host inventory entry to Datadog's backend, it does not have any auto-discovered tags associated with it. Nor is the set of tags Datadog expects (like availability-zone) clearly documented anywhere I can find, along with the structure/format of the values it accepts for each cloud provider.

This causes all hosts to show up in the host inventory in the "no-availability-zone" group, with no tags, e.g.

image

The collector may have pipelines configured with processors to enrich the metrics and logs streams with appropriate resource attributes like cloud.availability_zone. These appear to be ignored by the Datadog host_metadata exporter, and there's no clear/documented means I can find of setting them appropriately.

Describe the solution you'd like

If the datadog.host_metadata.hostname_source option is set to first_resource, OpenTelemetry semantic conventions should be used to map the standard tags on the resource payload to Datadog's internally expected tags for the host metadata. This mapping should be clearly documented - even if it's just a link to the relevant part of the code from the README, not hidden away in some other repo's golang code.

This would "do the right thing" for a daemonset based collector that uses the cloud discovery resource processor and/or k8s attributes processor.

Ideally some criterion should be supported to specify a specific metric that should be matched for these values, rather than just picking whatever comes first. That would ensure more stable and reliable node tags.

The set of host tags that the datadog backend ascribes special meanings to should be clearly documented.

The interaction of any explicit list of tags set as datadog.host_metadata.tags with auto discovered tags should be documented.

Describe alternatives you've considered

It could be possible to inject the tags manually by setting datadog.host_metadata.hostname_source using env-vars injected into the collector's DaemonSet workload via external means.

But this is difficult and impactical. Not all that information is necessarily known by whatever is deploying the workload, or available in the format that Datadog expects. It's also hard to know what tags DD actually expects to have values and what "spelling" of those values it expects for e.g. cloud provider zone names.

The kube downward-api is not suitable for this because most of the desirable information is present as labels on the kube Node, but not easily injected into the workload definition (DaemonSet etc) or Pod. The downward API does not provide a means of injecting labels from the containing node into a workload. So while Node labels like topology.kubernetes.io/zone, topology.kubernetes.io/region and node.kubernetes.io/instance-type are present, they are not easily mapped to env-vars that can be interpolated into the tags values.

The plugin doesn't appear to support using the kube API to discover and map node metadata. Nor should it, really; it'd be better to delegate this to the resource processors.

The kube downward-api doesn't support mapping load labels to pod workloads: kubernetes/kubernetes#40610 . Even if it did, that'd be verbose and unnecessary configuration when the collector should be able to query the kube apiserver to get this info, or read it via a processor.

Available workarounds are very ugly, see e.g. https://gmaslowski.com/kubernetes-node-label-to-pod/

Additional context

image
image

image

See how a sample set of metrics has sensible resource attributes, but these aren't reflected in the host tags or mapped to Datadog's "standard" tag names?

Also note the hostname is the internal cloud provider ID of the node, even though I actually set "datadog.hostname" in the config to the k8s node name.

@ringerc ringerc added enhancement New feature or request needs triage New item requiring triage labels Dec 8, 2023
@github-actions github-actions bot added the exporter/datadog Datadog components label Dec 8, 2023
Copy link
Contributor

github-actions bot commented Dec 8, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ringerc
Copy link
Author

ringerc commented Dec 13, 2023

Related issue with host metadata exporter not seeming to respect configured or discovered host name, using internal cloud-provider host-id instead: #29866

@ringerc ringerc closed this as completed Dec 18, 2023
@ringerc ringerc reopened this Dec 18, 2023
@ringerc
Copy link
Author

ringerc commented Dec 18, 2023

This issue is closely related to #29866 (comment)

It looks like the exporter has some incomplete code for tag discovery on GCP as part of opentelemetry-mapping-go . It has some hacks for AWS tagging in opentelemetry-collector-contrib exporter/datadogexporter/internal/hostmetadata . But nothing consistent or clear.

@mx-psi
Copy link
Member

mx-psi commented Jan 12, 2024

Hey, thanks for this issue. In general I agree this is not supported, but let me try to reply to some of the parts of your message to clarify and expand on what that means:

If the datadog.host_metadata.hostname_source option is set to first_resource, OpenTelemetry semantic conventions should be used to map the standard tags on the resource payload to Datadog's internally expected tags for the host metadata.

I mentioned this briefly on #29741 (comment), we are working on something like this. It's not going to work like this exactly (as I said on #29866 (comment) this is not how first_resource works).

This mapping should be clearly documented - even if it's just a link to the relevant part of the code from the README, [...]

We are working on improving our docs as well, both on the hostname part as well as the host metadata work I mentioned above. Stay tuned!

[...] not hidden away in some other repo's golang code.

We have a common repository for the mapping of OpenTelemetry since we reuse it also in the Datadog Agent's OTLP ingest implementation, that's why it's on a separate repository. No matter how we do it the implementation is going to be 'hidden' from one of the two repositories (Agent and Collector). If you think there is a way to improve visibility around this, I am happy to take any specific feedback you have.

The set of host tags that the datadog backend ascribes special meanings to should be clearly documented.

Agreed, this is not something I can personally help with since it's a general Datadog aspect, but I can rely the feedback.

See how a sample set of metrics has sensible resource attributes, but these aren't reflected in the host tags or mapped to Datadog's "standard" tag names?

We do map some of them https://github.com/DataDog/opentelemetry-mapping-go/blob/a7afc4a370f8df1ada06e2af22fde3ee1d0dd84e/pkg/otlp/attributes/attributes.go#L28-L95 and have other users using this mapping, if this isn't working for you this might be a bug on our end or a misconfiguration on yours.

@ringerc
Copy link
Author

ringerc commented Jan 31, 2024

@mx-psi The mappings you list at the end of your comment work fine for telemetry. The issue is that there's no promotion of important ones like cloud.region to the host metadata in the DD host inventory.

It appears that your recent changes in #30680 may offer a workaround for this, once suitably documented, going by changelog https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.93.0 entry

datadogexporter: Add support for setting host tags via host metadata. (#30680)
When the datadog.host.use_as_metadata resource attribute is set to true:

  • Nonempty string-value resource attributes starting with datadog.host.tag. will be added as host tags for the host associated with the resource.
  • deployment.environment and k8s.cluster.name as mapped to Datadog names and added as host tags for the host associated with the resource.

If I understand correctly, this allows the telemetry stream to mark a resource as being a source for DD host tags, and copy any resource attributes it wants to appear as host tags by prefixing them with datadog.host.tag..

@ringerc
Copy link
Author

ringerc commented Jan 31, 2024

@mx-psi I've tested the new functionality added to the DD exporter and can indeed see host tags delivered, but

a. it seems to break the node configuration info; and
b. it takes a long time for the tags to show up, about 30 minutes from first deployment

image

Very delayed tag sending

However, it seems to have a bug. If there is only one resource with datadog.host.use_as_metadata set, the tags never seem to get set and sent take about 30 minutes to be sent. It only appears to log one Sending host metadata payload event just after startup.

At a guess, if the telemetry payload with the datadog.host.use_as_metadata has not yet been seen by the time this first host metadata is sent, it doesn't seem to invalidate its cache and re-send metadata.

If there are two different resources with datadog.host.use_as_metadata set, and they have different tag sets, then the exporter will log Host metadata changed for host after payload and Sending host metadata payload, with the desired set of tags. But in this case it will repeatedly send the metdata over and over which is probably not appreciated by the DD backend.

With only one resource having datadog.host.use_as_metadata: true:

{"level":"debug","ts":1706740894.1891978,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"aks-d9a2706c0-25710132-vmss000000","socket-hostname":"upm-telemetry-forwarder-node-agent-fn4r7"},"internalHostname":"aks-d9a2706c0-25710132-vmss000000","otel_version":"0.93.0","agent-flavor":"otelcol-contrib","host-tags":{},"gohai":"{\"cpu\":{\"cache_size\":\"36608 KB\",\"cpu_cores\":\"1\",\"cpu_logical_processors\":\"2\",\"family\":\"6\",\"mhz\":\"2593.904\",\"model\":\"85\",\"model_name\":\"Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\",\"stepping\":\"7\",\"vendor_id\":\"GenuineIntel\"},\"filesystem\":null,\"memory\":{\"swap_total\":\"0kB\",\"total\":\"8129884kB\"},\"network\":{\"interfaces\":[{\"ipv4\":[\"10.240.0.211\"],\"ipv4-network\":\"10.240.0.0/15\",\"ipv6\":[\"fe80::286d:e9ff:febc:cb75\"],\"ipv6-network\":\"fe80::/64\",\"macaddress\":\"2a:6d:e9:bc:cb:75\",\"name\":\"eth0\"}],\"ipaddress\":\"10.240.0.211\",\"ipaddressv6\":\"fe80::286d:e9ff:febc:cb75\",\"macaddress\":\"2a:6d:e9:bc:cb:75\"},\"platform\":null}","resources":{"processes":{"snaps":[[1706740894,[]]]},"meta":{"host":"aks-d9a2706c0-25710132-vmss000000"}}}}

(then no further for 30min)

With two or more resources having datadog.host.use_as_metadata and different tag sets:

{"level":"debug","ts":1706739383.160596,"caller":"inframetadata@v0.13.1/reporter.go:139","msg":"Host metadata changed for host after payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","host":"8ef1d832-31b4-41c5-acb0-58848995c875","attributes":{}}
{"level":"debug","ts":1706739383.1606739,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"8ef1d832-31b4-41c5-acb0-58848995c875"},"internalHostname":"8ef1d832-31b4-41c5-acb0-58848995c875","otel_version":"","agent-flavor":"otelcol-contrib","host-tags":{"otel":["biganimal_cluster:p-x2d8kah40r","biganimal_instance:p-x2d8kah40r-1","biganimal_instance_role:primary","cloud_platform:azure_aks","region:australiaeast"]},"gohai":"{\"cpu\":{},\"filesystem\":[],\"memory\":{},\"network\":{},\"platform\":{\"hostname\":\"8ef1d832-31b4-41c5-acb0-58848995c875\"}}","resources":null}}
{"level":"debug","ts":1706739383.167567,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"aks-d9a2706c0-25710132-vmss000000","socket-hostname":"upm-telemetry-forwarder-node-agent-qxvtp"},"internalHostname":"aks-d9a2706c0-25710132-vmss000000","otel_version":"0.93.0","agent-flavor":"otelcol-contrib","host-tags":{},"gohai":"{\"cpu\":{\"cache_size\":\"36608 KB\",\"cpu_cores\":\"1\",\"cpu_logical_processors\":\"2\",\"family\":\"6\",\"mhz\":\"2593.904\",\"model\":\"85\",\"model_name\":\"Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\",\"stepping\":\"7\",\"vendor_id\":\"GenuineIntel\"},\"filesystem\":null,\"memory\":{\"swap_total\":\"0kB\",\"total\":\"8129884kB\"},\"network\":{\"interfaces\":[{\"ipv4\":[\"10.240.0.23\"],\"ipv4-network\":\"10.240.0.0/15\",\"ipv6\":[\"fe80::7438:6eff:fe2c:1cc2\"],\"ipv6-network\":\"fe80::/64\",\"macaddress\":\"76:38:6e:2c:1c:c2\",\"name\":\"eth0\"}],\"ipaddress\":\"10.240.0.23\",\"ipaddressv6\":\"fe80::7438:6eff:fe2c:1cc2\",\"macaddress\":\"76:38:6e:2c:1c:c2\"},\"platform\":null}","resources":{"processes":{"snaps":[[1706739383,[]]]},"meta":{"host":"aks-d9a2706c0-25710132-vmss000000"}}}}
{"level":"debug","ts":1706739403.9563246,"caller":"inframetadata@v0.13.1/reporter.go:139","msg":"Host metadata changed for host after payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","host":"8ef1d832-31b4-41c5-acb0-58848995c875","attributes":{}}
{"level":"debug","ts":1706739403.95637,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"8ef1d832-31b4-41c5-acb0-58848995c875"},"internalHostname":"8ef1d832-31b4-41c5-acb0-58848995c875","otel_version":"","agent-flavor":"otelcol-contrib","host-tags":{"otel":["cloud_platform:azure_aks","region:australiaeast"]},"gohai":"{\"cpu\":{},\"filesystem\":[],\"memory\":{},\"network\":{},\"platform\":{\"hostname\":\"8ef1d832-31b4-41c5-acb0-58848995c875\"}}","resources":null}}
...

Confirmed that if I wait long enough, DD will eventually send the extra tags, there's just a long delay.

So it looks likely there's a missed cache invalidation in there, where it doesn't send host metadata when discovering dynamic tags for the first time.

Host info lost when tags enabled

Interestingly it seems to send the tags separately to the rest of the metadata, and omits other host info when tags are sent.

The payload that has non-empty host-tags has "gohai":"{\"cpu\":{},\"filesystem\":[],\"memory\":{},\"network\":{},\"platform\":{\"hostname\":\"8ef1d832-31b4-41c5-acb0-58848995c875\"}}","resources":null}

The host info payloads with non-empty "gohai": continue to have empty host-tags.

{"level":"debug","ts":1706742684.1585107,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"aks-d9a2706c0-25710132-vmss000000","socket-hostname":"upm-telemetry-forwarder-node-agent-ngwzx"},"internalHostname":"aks-d9a2706c0-25710132-vmss000000","otel_version":"0.93.0","agent-flavor":"otelcol-contrib","host-tags":{},"gohai":"{\"cpu\":{\"cache_size\":\"36608 KB\",\"cpu_cores\":\"1\",\"cpu_logical_processors\":\"2\",\"family\":\"6\",\"mhz\":\"2593.904\",\"model\":\"85\",\"model_name\":\"Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\",\"stepping\":\"7\",\"vendor_id\":\"GenuineIntel\"},\"filesystem\":null,\"memory\":{\"swap_total\":\"0kB\",\"total\":\"8129884kB\"},\"network\":{\"interfaces\":[{\"ipv4\":[\"10.240.0.233\"],\"ipv4-network\":\"10.240.0.0/15\",\"ipv6\":[\"fe80::4865:6ff:fed3:2cf4\"],\"ipv6-network\":\"fe80::/64\",\"macaddress\":\"4a:65:06:d3:2c:f4\",\"name\":\"eth0\"}],\"ipaddress\":\"10.240.0.233\",\"ipaddressv6\":\"fe80::4865:6ff:fed3:2cf4\",\"macaddress\":\"4a:65:06:d3:2c:f4\"},\"platform\":null}","resources":{"processes":{"snaps":[[1706742684,[]]]},"meta":{"host":"aks-d9a2706c0-25710132-vmss000000"}}}}
{"level":"debug","ts":1706744474.1501598,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"8ef1d832-31b4-41c5-acb0-58848995c875"},"internalHostname":"8ef1d832-31b4-41c5-acb0-58848995c875","otel_version":"","agent-flavor":"otelcol-contrib","host-tags":{"otel":["biganimal_cluster:p-x2d8kah40r","biganimal_instance:p-x2d8kah40r-1","biganimal_instance_role:primary","cloud_platform:azure_aks","region:australiaeast"]},"gohai":"{\"cpu\":{},\"filesystem\":[],\"memory\":{},\"network\":{},\"platform\":{\"hostname\":\"8ef1d832-31b4-41c5-acb0-58848995c875\"}}","resources":null}}
{"level":"debug","ts":1706744475.1112905,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"aks-d9a2706c0-25710132-vmss000000","socket-hostname":"upm-telemetry-forwarder-node-agent-ngwzx"},"internalHostname":"aks-d9a2706c0-25710132-vmss000000","otel_version":"0.93.0","agent-flavor":"otelcol-contrib","host-tags":{},"gohai":"{\"cpu\":{\"cache_size\":\"36608 KB\",\"cpu_cores\":\"1\",\"cpu_logical_processors\":\"2\",\"family\":\"6\",\"mhz\":\"2593.904\",\"model\":\"85\",\"model_name\":\"Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\",\"stepping\":\"7\",\"vendor_id\":\"GenuineIntel\"},\"filesystem\":null,\"memory\":{\"swap_total\":\"0kB\",\"total\":\"8129884kB\"},\"network\":{\"interfaces\":[{\"ipv4\":[\"10.240.0.233\"],\"ipv4-network\":\"10.240.0.0/15\",\"ipv6\":[\"fe80::4865:6ff:fed3:2cf4\"],\"ipv6-network\":\"fe80::/64\",\"macaddress\":\"4a:65:06:d3:2c:f4\",\"name\":\"eth0\"}],\"ipaddress\":\"10.240.0.233\",\"ipaddressv6\":\"fe80::4865:6ff:fed3:2cf4\",\"macaddress\":\"4a:65:06:d3:2c:f4\"},\"platform\":null}","resources":{"processes":{"snaps":[[1706742684,[]]]},"meta":{"host":"aks-d9a2706c0-25710132-vmss000000"}}}}
{"level":"debug","ts":1706744484.1544993,"caller":"hostmetadata/metadata.go:123","msg":"Sending host metadata payload","kind":"exporter","data_type":"metrics","name":"datadog/datadog","payload":{"meta":{"hostname":"aks-d9a2706c0-25710132-vmss000000","socket-hostname":"upm-telemetry-forwarder-node-agent-ngwzx"},"internalHostname":"aks-d9a2706c0-25710132-vmss000000","otel_version":"0.93.0","agent-flavor":"otelcol-contrib","host-tags":{},"gohai":"{\"cpu\":{\"cache_size\":\"36608 KB\",\"cpu_cores\":\"1\",\"cpu_logical_processors\":\"2\",\"family\":\"6\",\"mhz\":\"2593.904\",\"model\":\"85\",\"model_name\":\"Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\",\"stepping\":\"7\",\"vendor_id\":\"GenuineIntel\"},\"filesystem\":null,\"memory\":{\"swap_total\":\"0kB\",\"total\":\"8129884kB\"},\"network\":{\"interfaces\":[{\"ipv4\":[\"10.240.0.233\"],\"ipv4-network\":\"10.240.0.0/15\",\"ipv6\":[\"fe80::4865:6ff:fed3:2cf4\"],\"ipv6-network\":\"fe80::/64\",\"macaddress\":\"4a:65:06:d3:2c:f4\",\"name\":\"eth0\"}],\"ipaddress\":\"10.240.0.233\",\"ipaddressv6\":\"fe80::4865:6ff:fed3:2cf4\",\"macaddress\":\"4a:65:06:d3:2c:f4\"},\"platform\":null}","resources":{"processes":{"snaps":[[1706742684,[]]]},"meta":{"host":"aks-d9a2706c0-25710132-vmss000000"}}}}

Also noteworthy, the hostname sent is different for the two sets of host metadata. The tags one sends internalHostname and hostname both set to the low-level cloud provider host-id 8ef1d832-31b4-41c5-acb0-58848995c875. The non-tags gohai one sends both hostname and internalHostname set to aks-d9a2706c0-25710132-vmss000000, the k8s node name. (I'm using config_or_system, and setting an explicit datadog.hostname from kube downward-api, in case it matters).
I wonder if this is the cause of #29866 ?

This appears to clobber the host system info sent to DD, which is normally populated, per screenshot above.

so the tags feature looks to introduce a regression too.

Configuration

The config snippet I'm using to set the attributes is in part

processors:
  transform/afterdiscovery-vendor:
    metric_statements:
      - context: resource
        statements:
          - |
            set(attributes["datadog.host.use_as_metadata"], true)
            where attributes["service.name"] == "my-collector-name-here"
          # DD host tag "availability_zone"
          - |
            set(attributes["datadog.host.tag.availability_zone"], attributes["cloud.availability_zone"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["cloud.availability_zone"] != nil
          # DD host tag "region"
          - |
            set(attributes["datadog.host.tag.region"], attributes["cloud.region"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["cloud.region"] != nil
          # DD host tag "cloud_provider" (not standard DD host tag in "common keys")
          - |
            set(attributes["datadog.host.tag.cloud_provider"], attributes["cloud.provider"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["cloud.provider"] != nil
          # DD host tag "cloud_platform" (not standard DD host tag in "common keys")
          - |
            set(attributes["datadog.host.tag.cloud_platform"], attributes["cloud.platform"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["cloud.platform"] != nil
          # DD host tag "kube_cluster"
          - |
            set(attributes["datadog.host.tag.kube_cluster"], attributes["k8s.cluster.name"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["k8s.cluster.name"] != nil
          # DD host kube_node_name
          - |
            set(attributes["datadog.host.tag.kube_node_name"], attributes["k8s.node.name"])
            where attributes["datadog.host.use_as_metadata"] == true
              and attributes["k8s.node.name"] != nil

@ringerc
Copy link
Author

ringerc commented Feb 21, 2024

@mx-psi Not sure if you saw above ^ outcome of testing the datadog.host.tag feature ?

@mx-psi
Copy link
Member

mx-psi commented Feb 29, 2024

@ringerc I was on vacation, thanks for the ping. Will take me some time to go through the resulting backlog but will try to have a look at this by end of next week

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request exporter/datadog Datadog components priority:p2 Medium Stale
Projects
None yet
Development

No branches or pull requests

2 participants