Reduce risk of running fgprof in production #12

felixge · 2020-12-30T13:14:06Z

The goal for this issue is to figure out how to reduce the risk of running fgprof in production. People should feel safe turning it on, knowing that in the worst case it will just bail out or reduce the amount of data being gathered rather than having a negative impact on the application.

There has been some initial discussion here: #11 (comment)

felixge · 2020-12-30T13:31:44Z

In reply to @sumerc's comment here: #11 (comment)

I was too busy lately and could not find time to look/think into this too much.

No worries, me too 🙈.

But I thought about the problem a bit more and I agree that your approach of trying to bound profiler wall time into %1 seems both simple to implement and powerful enough like you suggested.

Couple of issues that came to my mind, a bit overthinking maybe for the current phase, but curious about your thoughts on these:

While I have not tested it yet: I assume when STW occurs, it might have some adverse effects on the system as it will abuse cache locality of some processors.

Yes, I think that's a valid concern. But it's probably an issue for later. There are many low hanging fruits that can be picked while completely ignoring this issue.

One test that we could either do manually or automatically is to have a large number of CPU intensive goroutines executing and do some work with them. THen we calculate these both with profiler and without profiler and see if the output is also changed like %1 or so. It would be nice to test this as there might be other kind of adverse effects as well.

Yeah, that'd be cool! But I suspect we can get away with a simpler approach for the first iteration.

I think this is more major problem: what happens if GoroutineProfile itself is actually taking more than %1? Yeah, we could simply skip some samples on the next iterations but that might actually hurt accuracy? What if it took 30ms? 50ms? 100ms? Those numbers are just (maybe?) too big for the application? Well again: I have not yet tested it myself but I think this is not too uncommon if you have many goroutines: golang/go#33250. Well one idea might be to make an estimation on how long a GoroutineProfile might take based on previous data, but again: I don't know what we will do even if that is the case. Another idea: get some help from the runtime: like what if GoroutineProfile returns some data even if len(p)<=n. Even when we don't have all the goroutine data, at least we might have something(maybe some kind of randomness can be applied in the runtime for returning different GOROs everytime?). So, instead of returning nothing, we will have at least something to profile this way. Otherwise it is completely binary, right? Any idea on this?

IMO this is a strong argument for stopping the profiling with an error if we detect that it's not going well. I'd rather produce no data than to risk having a very negative impact on the application and/or producing useless data.

One potential approach: We accumlate the time fgpro spends in GoroutineProfile() profile and exit the processing as soon as we exceed the "budget". So if we have a very slow GoroutineProfile() call it might cause fgprof to stop with an error right away, only minimally impacting the application.

If we're worried that even a single GoroutineProfile() call could have to much of an impact, we could try to use runtime.NumGoroutine() (which should be much faster) to estimate the cost beforehand. But initially we could try without this complication.

Ah one last thing:

While I was re-reading your previous comments, I could not understand this one:

Of course there is plenty of room to get fancy. E.g. using exponentially weighted averages for tracking the overhead or PID controllers for adapting the frequency.

Can you elaborate a bit what this means?

What I meant is that instead of trying to stay below our "profiling time budget", we could try to use as much of it as possible, i.e. increase our sampling interval dynamically if we notice we could do more samples per second. PID controllers are a good way to adjust an input variable (in our case sampling frequency) in response to a measured process variable (in our case "STW duration per second") to keep it near a given setpoint (in our case our STW duration per second budget).

Does that make sense?

No worries if not. PIDs might be complete overkill for this, we'd probably only use the proportional term which is trivial and not deserving a fancy description like PID controller : p.

sumerc · 2021-01-02T16:22:38Z

IMO this is a strong argument for stopping the profiling with an error if we detect that it's not going well.

Agreed. My concern was to detect situations like these. I also think that stopping the profiler might as well be the best option for these situations.

If we're worried that even a single GoroutineProfile() call could have to much of an impact, we could try to use runtime.NumGoroutine() (which should be much faster) to estimate the cost beforehand.

Yes. This came to my mind as well. Agreed that it is still too early like you suggested. BTW, as operations done in GoroutineProfile is usually the same, I assume we might have very accurate estimations. Anyway...

Does that make sense?

Yes. When you say PID I thought we will be reading some external process related counters or something like that :) :)

Anyway, I will try working on this whenever I find any time, I don't expect it to be too much complicated AFAIU. However, the tricky part is finding any time :). Is there somehow any related work/issue on your end waiting on this item or such? Knowing beforehand might help me better utilize my time.

felixge · 2021-01-02T17:17:26Z

Anyway, I will try working on this whenever I find any time, I don't expect it to be too much complicated AFAIU. However, the tricky part is finding any time :). Is there somehow any related work/issue on your end waiting on this item or such? Knowing beforehand might help me better utilize my time.

Sounds good!

About related work: I'm starting a new job on Monday (joining Datadog's profiler team), and there is a decent chance I'll work on related stuff a lot going forward.

The details are still fuzzy, but one thing I was considering was to propose a patch to the Go core guys that would allow GoroutineProfile() to take a pprof.LabelSet argument and only returning those that match. This might a good way to do wallclock profiling for programs with large number of goroutines.

Anyway, not sure if I'll get a chance to work on this yet or not, I'll keep you posted!

sumerc · 2021-01-02T21:37:10Z

About related work: I'm starting a new job on Monday (joining Datadog's profiler team), and there is a decent chance I'll work on related stuff a lot going forward.

Congrats! That is awesome :)

The details are still fuzzy, but one thing I was considering was to propose a patch to the Go core guys that would allow GoroutineProfile() to take a pprof.LabelSet argument and only returning those that match. This might a good way to do wallclock profiling for programs with large number of goroutines.

Yes. It is a good idea. In any case, Goroutineprofile seems to be in need of some kind of refactor for situations like these? The current one: returning all or nothing is a bit too strict.

felixge · 2021-01-04T08:53:43Z

Yes. It is a good idea. In any case, Goroutineprofile seems to be in need of some kind of refactor for situations like these? The current one: returning all or nothing is a bit too strict.

Yes. The API change will probably be the most controversial aspect of proposing this to the Go core. But let's hope they see the value in wallclock profiling.

felixge mentioned this issue Dec 30, 2020

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

Closed

Fluepke mentioned this issue Dec 27, 2021

Add configuration option to disable cpu profiling endpoint in promtail grafana/loki#5005

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce risk of running fgprof in production #12

Reduce risk of running fgprof in production #12

felixge commented Dec 30, 2020

felixge commented Dec 30, 2020

sumerc commented Jan 2, 2021 •

edited

felixge commented Jan 2, 2021

sumerc commented Jan 2, 2021 •

edited

felixge commented Jan 4, 2021

Reduce risk of running fgprof in production #12

Reduce risk of running fgprof in production #12

Comments

felixge commented Dec 30, 2020

felixge commented Dec 30, 2020

sumerc commented Jan 2, 2021 • edited

felixge commented Jan 2, 2021

sumerc commented Jan 2, 2021 • edited

felixge commented Jan 4, 2021

sumerc commented Jan 2, 2021 •

edited

sumerc commented Jan 2, 2021 •

edited