Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log warning when failing to parse openmetrics response #17514

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Nevon
Copy link
Contributor

@Nevon Nevon commented May 3, 2024

What does this PR do?

Log a warning when the plain text prometheus metric parsing fails, including the line that it failed to parse. For example:

2024-05-15 14:58:11 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:131 in LogMessage) | openmetrics:c2c_prometheus:21fa576d88f0d1f8 | (mixins.py:490) | Failed to parse metric response 'histogram_bucket{le="0.01"💣} 0': Invalid metric name: histogram_bucket{le="0.01"💣}

Motivation

This issue

Currently the text_fd_to_metric_families just raises generic errors that don't provide enough context to know what it was that couldn't be parsed, which is problematic when deployed into environments where a single agent is parsing metrics from many different sources via autodiscovery.

Additional Notes

I'm very much not a Python developer, so let me know if what I've done is very unidiomatic. I considered wrapping the exception in my own ParserException, but from reading the other checks that didn't seem like a common thing to do. I also wasn't able to actually run the integration tests locally, because they seem to only work with Docker and not docker-compatible clients like nerdctl, but I'm assuming the CI pipeline will.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Changelog entries must be created for modifications to shipped code
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

codecov bot commented May 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.72%. Comparing base (2a6ec26) to head (52e6dd0).
Report is 103 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
aerospike 87.17% <ø> (+0.32%) ⬆️
airflow ?
cassandra_nodetool 93.16% <ø> (?)
ceph 91.07% <ø> (ø)
clickhouse 94.18% <ø> (-1.47%) ⬇️
couch ?
couchbase ?
disk 82.30% <ø> (?)
dns_check 90.97% <ø> (-2.36%) ⬇️
dotnetclr ?
elastic 93.36% <ø> (ø)
exchange_server ?
gearmand ?
gitlab 91.95% <ø> (?)
glusterfs 80.09% <ø> (+0.90%) ⬆️
haproxy 95.13% <ø> (+10.53%) ⬆️
harbor ?
ibm_was ?
iis 93.98% <ø> (+39.04%) ⬆️
marklogic 96.09% <ø> (ø)
mongo 95.59% <ø> (?)
mysql 87.83% <ø> (?)
oracle 88.33% <ø> (ø)
pdh_check 97.36% <ø> (+1.71%) ⬆️
postgres ?
rabbitmq 95.73% <ø> (+20.95%) ⬆️
redisdb ?
sap_hana ?
scylla ?
snmp 95.76% <ø> (+3.81%) ⬆️
sonarqube 98.24% <ø> (ø)
sqlserver 88.25% <ø> (+0.04%) ⬆️
ssh_check ?
tcp_check 88.16% <ø> (-3.42%) ⬇️
vault ?
vertica 98.34% <ø> (+12.56%) ⬆️
voltdb ?
vsphere ?
windows_service 93.86% <ø> (?)
zk ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link

github-actions bot commented May 3, 2024

Test Results

 1 046 files   1 046 suites   6h 38m 45s ⏱️
 7 053 tests  6 976 ✅    75 💤 1 ❌ 1 🔥
32 084 runs  26 588 ✅ 5 494 💤 1 ❌ 1 🔥

For more details on these failures and errors, see this check.

Results for commit 9663a00.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@iliakur iliakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nevon thanks a bunch for your contribution!!

Apologies for the delay in response. Here are my thoughts.

Let's bake the logging functionality into text_fd_to_metric_families.
We'd need to make the following changes.
Move the parsing logic as-is to some private function in the same module as text_fd_to_metric_families, eg _parse_payload.

Then change text_fd_to_metric_families to be something along the following lines:

from itertools import tee
...
def text_fd_to_metric_families(fd):
    raw_lines, input_lines = tee(fd, n=2)
    try:
        for raw_line, metric_family in zip(raw_lines, _parse_payloads(input_lines):
            yield metric_family
    except Exception as e:
        raise ValueError("Failed to parse the metric response {}: {}".format(raw_line, e))

A few notes:

  • using itertools.tee is less conceputal overhead imo than a custom class
  • Let's just raise an exception with the relevant info. This is a crash, logging can be taken care of.

@Nevon
Copy link
Contributor Author

Nevon commented May 17, 2024

Let's bake the logging functionality into text_fd_to_metric_families.

Are you sure? The reason I didn't do that in there, which would have been simpler to do, is because it's a straight up copy from the Python prometheus client, with some minor change to avoid a backwards incompatible change made in the upstream (I didn't look much into it). If I change more things in there it'll become harder to get rid of this technical debt and use the upstream directly.

I'm fine with making those changes, but I just want confirmation that you're aware of this before I do.

@iliakur
Copy link
Contributor

iliakur commented May 20, 2024

@Nevon good point! I still think we can proceed with the "baked in" approach:

  1. It's been 4 years since that code was committed. Not sure we're ever getting rid of that debt.
  2. Note how in the sketch I posted we don't modify the parsing logic but rather wrap it? This is on purpose, I too don't want to mix the book-keeping with the parsing. We can put a nice fat comment next to or inside the _parse function to make it explicit that we don't want to touch this parsing logic.

@Nevon Nevon force-pushed the log-openmetrics-parsing-error branch 3 times, most recently from b5f58c3 to e891e6e Compare May 23, 2024 13:53
@Nevon Nevon force-pushed the log-openmetrics-parsing-error branch from e891e6e to 3ad3d0a Compare May 23, 2024 14:49
def text_fd_to_metric_families(fd):
raw_lines, input_lines = tee(fd, 2)
try:
for raw_line, metric_family in zip(raw_lines, _parse_payload(input_lines)): # noqa: B007
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to disable the "loop control variable not used in body" rule for this line. Technically I could also have worked around it like:

try:
    for _, metric_family in zip(raw_lines, _parse_payload(input_lines)):
        yield metric_family
except Exception as e:
    raise ValueError("Failed to parse the metric response '{}': {}".format(_, e))

But that's making the code worse to please the linter, in my eyes.

@Nevon
Copy link
Contributor Author

Nevon commented May 23, 2024

Went ahead and implemented the changes we talked about. Thanks for teaching me about tee. As a non-python developer I'm not particularly familiar with the standard library, and my googling didn't lead me to any better way to get a reference to the current element of an iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants