Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processes: do not post metrics from processes that are not running. #4121

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zsfz
Copy link
Contributor

@zsfz zsfz commented Jun 12, 2023

ChangeLog: processes plugin: do not post metrics from processes that are not running.

Currently, processes plugin will emit metrics even if the a process listed in ProcessMatch is not running. This is unnecessary and causes wasteful network/disk activity.
For process not running all the metrics are 0 but they still submit data. This will lead to a waste of resources.

If a process listed on ProcessMatch runs after collectd is started, those process will start sending metrics.

Please help review. Thanks a lot!

Signed-off-by: sujankhadka <sujankhadka@didiglobal.com>
Signed-off-by: tiozhang <zyhtheonly@yeah.net>
Signed-off-by: luffysong <zsfz_one@163.com>
@zsfz zsfz requested a review from a team as a code owner June 12, 2023 10:29
@eero-t
Copy link
Contributor

eero-t commented Aug 18, 2023

Zombie (exited, but non-reaped) processes can still keep some system resources, until they are reaped by their parent (or init if parent has disappeared). And their presence is also an indication of some potential system issue, so skipping them should IMHO be at least a user-controllable option.

@mrunge
Copy link
Member

mrunge commented Aug 28, 2023

I do agree with @eero-t , could you please add a switch to skip zombies and make it not-skip by default? That way, it would also be fully backwards-compatible.

@eero-t
Copy link
Contributor

eero-t commented Aug 28, 2023

If one sees large number of zombies in process list, that's the issue that should be fixed, not what collectd reports.

I also think that this change would be constantly spamming collectd log with warnings. All logging that could happen constantly, should be rate-limited, as filling disk can have much more dire consequences (lost data, even boot fails).

zsfz and others added 2 commits August 28, 2023 18:02
Signed-off-by: sujankhadka <sujankhadka@didiglobal.com>
Signed-off-by: tiozhang <zyhtheonly@yeah.net>
Signed-off-by: luffysong <zsfz_one@163.com>

Co-authored-by: tiozhang <zyhtheonly@yeah.net>
@zsfz
Copy link
Contributor Author

zsfz commented Aug 28, 2023

I replaced WARNING with DEBUG. How about that?

@eero-t
Copy link
Contributor

eero-t commented Aug 28, 2023

I replaced WARNING with DEBUG. How about that?

The PR itself looks fine now, but I just wonder in what case you see so many zombies that it has become an issue...

What process it is, what is its parent, and why that is not reaping its zombie children?

(Feels like I'm asking character motivations a horror movie...)

@zsfz
Copy link
Contributor Author

zsfz commented Aug 28, 2023

The specific issue I'm facing is unrelated to zombie processes. My main concern is that I have configured a significant number of processes to monitor in the configuration file (without certainty of their runtime). However, even though certain processes specified in the configuration file have not been run since the start of the collectd service until its stop, the 'processes' plugin code still creates 'procstat' structures for these unexecuted processes and submits meaningless metrics with values of 0.
This is unnecessary and causes wasteful network/disk activity.

@eero-t
Copy link
Contributor

eero-t commented Aug 28, 2023

The specific issue I'm facing is unrelated to zombie processes. My main concern is that I have configured a significant number of processes to monitor in the configuration file (without certainty of their runtime). However, even though certain processes specified in the configuration file have not been run since the start of the collectd service until its stop, the 'processes' plugin code still creates 'procstat' structures for these unexecuted processes and submits meaningless metrics with values of 0. This is unnecessary and causes wasteful network/disk activity.

Your check returns if process has no threads. AFAIK that means that they're zombies (exited, but not reaped by their parent). See e.g. the plugin source you're modifying: https://github.com/collectd/collectd/blob/main/src/processes.c#L1390

What state they are in, if not zombies (GNU ps column 6 is NLWP)?
ps -Lf x | awk '{if ($6 == 0) print}'

(And if they really are not zombies, on which OS that is?)

@zsfz
Copy link
Contributor Author

zsfz commented Aug 28, 2023

The processes with 'num_proc = 0' include both zombie processes and processes that have never been executed. My pull request skips submitting metrics for these two types of processes. Check it out here: https://github.com/collectd/collectd/blob/main/src/processes.c#L613
I acknowledge that this pull request might be somewhat imprecise. Perhaps there's a better approach to handling these two scenarios separately. The issue I've encountered is that a process, which should be monitored, has never been executed, yet it still submits meaningless data with values of 0.

@eero-t
Copy link
Contributor

eero-t commented Aug 28, 2023

Ah, this was about processes that do not exist at all, not just ones that are not currently running.

Now it makes more sense, I did not realize plugin to report metrics although there were no matching processes (I've never used it myself). In some situation that behavior is desirable, in others not, so it's good to have an option for it.

@zsfz
Copy link
Contributor Author

zsfz commented Aug 28, 2023

Ah, this was about processes that do not exist at all, not just ones that are not currently running.

Now it makes more sense, I did not realize plugin to report metrics although there were no matching processes (I've never used it myself). In some situation that behavior is desirable, in others not, so it's good to have an option for it.

Yeah. For those processes that do not exist at all, the processes plugin resets all metrics to 0 and submits them. This leads to unnecessary disk and network wastage.

@eero-t
Copy link
Contributor

eero-t commented Sep 20, 2023

Your new commit adds NVML / CUDA calls. That stuff does not belong to this PR, please file a separate PR for it.

@zsfz
Copy link
Contributor Author

zsfz commented Sep 20, 2023

Your new commit adds NVML / CUDA calls. That stuff does not belong to this PR, please file a separate PR for it.

Oh sorry, I forgot about this PR. I realize that I didn't switch to the correct branch and it has been a while since the pull request was opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants