Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce risk of running fgprof in production #12

Open
felixge opened this issue Dec 30, 2020 · 5 comments
Open

Reduce risk of running fgprof in production #12

felixge opened this issue Dec 30, 2020 · 5 comments

Comments

@felixge
Copy link
Owner

felixge commented Dec 30, 2020

The goal for this issue is to figure out how to reduce the risk of running fgprof in production. People should feel safe turning it on, knowing that in the worst case it will just bail out or reduce the amount of data being gathered rather than having a negative impact on the application.

There has been some initial discussion here: #11 (comment)

@felixge
Copy link
Owner Author

felixge commented Dec 30, 2020

In reply to @sumerc's comment here: #11 (comment)

I was too busy lately and could not find time to look/think into this too much.

No worries, me too 🙈.

But I thought about the problem a bit more and I agree that your approach of trying to bound profiler wall time into %1 seems both simple to implement and powerful enough like you suggested.

Couple of issues that came to my mind, a bit overthinking maybe for the current phase, but curious about your thoughts on these:

While I have not tested it yet: I assume when STW occurs, it might have some adverse effects on the system as it will abuse cache locality of some processors.

Yes, I think that's a valid concern. But it's probably an issue for later. There are many low hanging fruits that can be picked while completely ignoring this issue.

One test that we could either do manually or automatically is to have a large number of CPU intensive goroutines executing and do some work with them. THen we calculate these both with profiler and without profiler and see if the output is also changed like %1 or so. It would be nice to test this as there might be other kind of adverse effects as well.

Yeah, that'd be cool! But I suspect we can get away with a simpler approach for the first iteration.

I think this is more major problem: what happens if GoroutineProfile itself is actually taking more than %1? Yeah, we could simply skip some samples on the next iterations but that might actually hurt accuracy? What if it took 30ms? 50ms? 100ms? Those numbers are just (maybe?) too big for the application? Well again: I have not yet tested it myself but I think this is not too uncommon if you have many goroutines: golang/go#33250. Well one idea might be to make an estimation on how long a GoroutineProfile might take based on previous data, but again: I don't know what we will do even if that is the case. Another idea: get some help from the runtime: like what if GoroutineProfile returns some data even if len(p)<=n. Even when we don't have all the goroutine data, at least we might have something(maybe some kind of randomness can be applied in the runtime for returning different GOROs everytime?). So, instead of returning nothing, we will have at least something to profile this way. Otherwise it is completely binary, right? Any idea on this?

IMO this is a strong argument for stopping the profiling with an error if we detect that it's not going well. I'd rather produce no data than to risk having a very negative impact on the application and/or producing useless data.

One potential approach: We accumlate the time fgpro spends in GoroutineProfile() profile and exit the processing as soon as we exceed the "budget". So if we have a very slow GoroutineProfile() call it might cause fgprof to stop with an error right away, only minimally impacting the application.

If we're worried that even a single GoroutineProfile() call could have to much of an impact, we could try to use runtime.NumGoroutine() (which should be much faster) to estimate the cost beforehand. But initially we could try without this complication.

Ah one last thing:

While I was re-reading your previous comments, I could not understand this one:

Of course there is plenty of room to get fancy. E.g. using exponentially weighted averages for tracking the overhead or PID controllers for adapting the frequency.

Can you elaborate a bit what this means?

What I meant is that instead of trying to stay below our "profiling time budget", we could try to use as much of it as possible, i.e. increase our sampling interval dynamically if we notice we could do more samples per second. PID controllers are a good way to adjust an input variable (in our case sampling frequency) in response to a measured process variable (in our case "STW duration per second") to keep it near a given setpoint (in our case our STW duration per second budget).

Does that make sense?

No worries if not. PIDs might be complete overkill for this, we'd probably only use the proportional term which is trivial and not deserving a fancy description like PID controller : p.

@sumerc
Copy link

sumerc commented Jan 2, 2021

IMO this is a strong argument for stopping the profiling with an error if we detect that it's not going well.

Agreed. My concern was to detect situations like these. I also think that stopping the profiler might as well be the best option for these situations.

If we're worried that even a single GoroutineProfile() call could have to much of an impact, we could try to use runtime.NumGoroutine() (which should be much faster) to estimate the cost beforehand.

Yes. This came to my mind as well. Agreed that it is still too early like you suggested. BTW, as operations done in GoroutineProfile is usually the same, I assume we might have very accurate estimations. Anyway...

Does that make sense?

Yes. When you say PID I thought we will be reading some external process related counters or something like that :) :)

Anyway, I will try working on this whenever I find any time, I don't expect it to be too much complicated AFAIU. However, the tricky part is finding any time :). Is there somehow any related work/issue on your end waiting on this item or such? Knowing beforehand might help me better utilize my time.

@felixge
Copy link
Owner Author

felixge commented Jan 2, 2021

Anyway, I will try working on this whenever I find any time, I don't expect it to be too much complicated AFAIU. However, the tricky part is finding any time :). Is there somehow any related work/issue on your end waiting on this item or such? Knowing beforehand might help me better utilize my time.

Sounds good!

About related work: I'm starting a new job on Monday (joining Datadog's profiler team), and there is a decent chance I'll work on related stuff a lot going forward.

The details are still fuzzy, but one thing I was considering was to propose a patch to the Go core guys that would allow GoroutineProfile() to take a pprof.LabelSet argument and only returning those that match. This might a good way to do wallclock profiling for programs with large number of goroutines.

Anyway, not sure if I'll get a chance to work on this yet or not, I'll keep you posted!

@sumerc
Copy link

sumerc commented Jan 2, 2021

About related work: I'm starting a new job on Monday (joining Datadog's profiler team), and there is a decent chance I'll work on related stuff a lot going forward.

Congrats! That is awesome :)

The details are still fuzzy, but one thing I was considering was to propose a patch to the Go core guys that would allow GoroutineProfile() to take a pprof.LabelSet argument and only returning those that match. This might a good way to do wallclock profiling for programs with large number of goroutines.

Yes. It is a good idea. In any case, Goroutineprofile seems to be in need of some kind of refactor for situations like these? The current one: returning all or nothing is a bit too strict.

@felixge
Copy link
Owner Author

felixge commented Jan 4, 2021

Yes. It is a good idea. In any case, Goroutineprofile seems to be in need of some kind of refactor for situations like these? The current one: returning all or nothing is a bit too strict.

Yes. The API change will probably be the most controversial aspect of proposing this to the Go core. But let's hope they see the value in wallclock profiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants