Callback duration metric #439

benbierens · 2023-08-08T08:15:49Z

All right... this doesn't currently work and I need some help.

Here's the goal:
As a chronos user, I want to be able to measure callbacks durations (when using debug builds), so that I can be informed about long callbacks that will cause my app to block/not respond.

I've added a compile-flag and some code to asyncloop.nim to try and get this going, only to realize that chronos doesn't use nim-metrics. (And can't because nim-metrics uses chronos?) So I'm writing this message to ask help from you, experienced chronos developers! Do you see a use for a feature of this sort? How can this best be accomplished?

arnetheduck · 2023-08-09T14:32:16Z

importing metrics here would create a circular dep that would need to be broken - but other than that, metrics aren't really appropriate for the task: async calls generally get broken up into multiple callbacks while gauges in metrics only record the latest value, and only the time it takes - if you have one request taking 1s and a 1ms request just after, the gauge will say 1ms and you will find out nothing - or it will say 1s but you won't know which callback it was because chronos typically processes thousands of them, specially when building up combined future flows.

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

dryajov · 2023-08-09T19:33:50Z

Agree, metrics aren't the right way to measure this. I like @arnetheduck idea about pre/post callbacks, makes sense.

elcritch · 2023-08-10T07:51:44Z

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

I was thinking about that. It seems like it'd be possible to allow generic pre/post calls. It's not quite clear if it'd be more useful to track all callbacks or Future's or closure-iterators.

I took a stab at it using a more concrete timing for each closure iterator for each async proc: #442

That approach doesn't allow for getting stacktraces, or only recording long running ones.

cheatfate · 2023-08-10T11:59:27Z

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

If there no re-entrant calls to poll callstack will be always the same and looks like this:

NimMain
...
poll()
callback()

And ... is some starting code your binary executes before it starts executing poll() in loop.

Calculating callbacks duration is pretty uninformative and intrusive for chronos scheduler. If you {.async.} procedure is slow - it is slow, if you need to get time of execution of this procedure - establish time measurement around this procedure. Metrics you trying to meter will help you measure parts of your procedures. For example:

proc someFoo() {.async.} =
  # Callback 1 measurement start
  ...
  # Callback 1 measurement finish
  await someOtherFooCall()
  # Callback 2 measurement start
  ...
  # Callback 2 measurement finish
  await someNewFooCall()
  # Callback 3 measurement start
  ...
  # Callback 3 measurement finish

So you got 3 measurement of code between calls to await. Is it what you need? I dont see any reasons why chronos should have facility to make time measurements of this places, and this become even more confusing if you have something like that:

proc someFoo() {.async.} =
  for item in iter():
    # Callback time measurement ends
    await someNewFooCall()
    # Callback time measurement starts

arnetheduck · 2023-08-10T12:36:53Z

I took a stab at it using a more concrete timing for each closure iterator for each async proc: #442

I think this is being approached from the wrong end: ie I'd recommend first building a useful end-to-end solution for something concrete - ie something that provides value - if, in the end, it so happens that there exists a feature of sufficient genericity that it indeed would be easier to create a generic / shared solution in chronos, we can look at it from the perspective of a specific useful thing - like this, we're just throwing mud at a wall to see if something will stick and what shape it'll have.

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

dryajov · 2023-08-10T15:50:27Z

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

I think your original proposal with pre/post hooks makes more sense. It allows building this types of features on top of chronos without bundling them with chronos unless they prove useful...

Of course, even the simple callback approach has to be carefully thought out - do we want it as a generic future that or only something that sits behind a flag and only allowed in debug builds?

elcritch · 2023-08-10T21:36:10Z

So you got 3 measurement of code between calls to await. Is it what you need? I dont see any reasons why chronos should have facility to make time measurements of this places, and this become even more confusing if you have something like that:

That's a useful question, and one I'm trying to understand.

Measuring the non-async sections can be useful since these sections are often the parts that disrupt overall tail latencies in a project. It's often useful to tracking down incorrect usage of blocking calls (disk IO, compute, blocking network calls, unbroken for-loops, etc).

Meanwhile, the time of the async portions can be somewhat gauged by proxy using total request time.

If I understand it, the callback approach described above wouldn't provide the granularity to measure the actual time it takes to execute the async proc time, only the total duration.

elcritch · 2023-08-10T21:57:04Z

I think this is being approached from the wrong end: ie I'd recommend first building a useful end-to-end solution for something concrete - ie something that provides value -

I disagree; this is a useful concrete solution to measure timings which would otherwise not be readily feasible to us. Also, the feedback from @cheatfate was helpful.

if, in the end, it so happens that there exists a feature of sufficient genericity that it indeed would be easier to create a generic / shared solution in chronos, we can look at it from the perspective of a specific useful thing - like this, we're just throwing mud at a wall to see if something will stick and what shape it'll have.

I'm onboard with creating a more generic solution if possible, but wanted to spitball something concrete and see how well it worked. These sort of timings could provide our project value, and possibly others, since non-async portions can easily cause issues.

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

Isn't that what a draft PR is for? ;) Still that was my first approach, but I don't have access to create a branch and things get lost on a fork. I am also new to Chronos as a project and unsure the best way to get useful feedback. A draft PR gives a chance for a broader audience to chime in.

elcritch · 2023-08-11T03:18:37Z

I took my PR from yesterday and used it to make a more generic solution based on callbacks in FutureBase. It's not ready for upstreaming so I left it on my fork, but if you want to see it: elcritch#2

I'm not 100% sure if it's where you were thinking of adding the callbacks @arnetheduck. Future's made the most sense for a generic API as it could cover both iterators and callbacks. It'd need more testing to see how useful it'd be.

Note, I also needed a third callback onFuturePause to enable collecting execution time and not just the total elapsed time from once the Future starts. As @cheatfate mentioned that wouldn't include "child futures". That might be possible by combining it with the FutureTracking feature. All the data should in theory be able to be collected downstream.

I'd say this PR could probably be closed?

benbierens added 2 commits August 8, 2023 09:20

Sets up the callback duration metric

e07f0dc

Applies chronos conventions

ac2fabe

elcritch mentioned this pull request Aug 10, 2023

Per callback duration stats #442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Callback duration metric #439

Callback duration metric #439

benbierens commented Aug 8, 2023 •

edited

arnetheduck commented Aug 9, 2023

dryajov commented Aug 9, 2023

elcritch commented Aug 10, 2023

cheatfate commented Aug 10, 2023

arnetheduck commented Aug 10, 2023 •

edited

dryajov commented Aug 10, 2023

elcritch commented Aug 10, 2023

elcritch commented Aug 10, 2023

elcritch commented Aug 11, 2023 •

edited

Callback duration metric #439

Are you sure you want to change the base?

Callback duration metric #439

Conversation

benbierens commented Aug 8, 2023 • edited

arnetheduck commented Aug 9, 2023

dryajov commented Aug 9, 2023

elcritch commented Aug 10, 2023

cheatfate commented Aug 10, 2023

arnetheduck commented Aug 10, 2023 • edited

dryajov commented Aug 10, 2023

elcritch commented Aug 10, 2023

elcritch commented Aug 10, 2023

elcritch commented Aug 11, 2023 • edited

benbierens commented Aug 8, 2023 •

edited

arnetheduck commented Aug 10, 2023 •

edited

elcritch commented Aug 11, 2023 •

edited