Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): add pipeline average time metrics #3845

Closed

Conversation

NDStrahilevitz
Copy link
Collaborator

1. Explain what the PR does

4aed8fc feat(metrics): add pipeline average time metrics

    Add two prometheus gauges measuring the following metrics:
    1. Average time spent from kernel to decoding
    2. Average time spent from kernel to publishing

2. Explain how to test it

  1. tracee --metrics
  2. Enter localhost:3366/metrics
  3. Check the pipeline metrics

3. Other comments

Resolve #3844

pkg/utils/time.go Show resolved Hide resolved
pkg/ebpf/events_pipeline.go Outdated Show resolved Hide resolved
pkg/metrics/stats.go Show resolved Hide resolved
@geyslan
Copy link
Member

geyslan commented Feb 21, 2024

We have this update #3875

Please rebase your PR against main to make use of the new workflow setup.

@geyslan
Copy link
Member

geyslan commented Apr 3, 2024

@NDStrahilevitz I'm not able to run the github actions on this. If it's ready for review, could you rebase it again? Tks.

Add two prometheus gauges measuring the following metrics:
1. Average time spent from kernel to decoding
2. Average time spent from kernel to publishing
if err := ebpfMsgDecoder.DecodeContext(&eCtx); err != nil {
t.handleError(err)
continue
}
startTimeKernel := eCtx.Ts
_ = t.stats.AvgTimeInKernel.Add(endTimeKernel - startTimeKernel)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is actually the time in the kernel + submit time + the time it took tracee to read the buffer, isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be kernel + submit time + time blocked in channel (what you meant by read time?). Note that the endpoint timestamp is taken before we decode the buffer.
This time blocked in channel is actually critical and I haven't considered it. This measurement should be rethought.

t.streamsManager.Publish(ctx, *event)
_ = t.stats.EventCount.Increment()
endTime := uint64(utils.GetMonotonicTime())
_ = t.stats.AvgTimeInPipeline.Add(endTime - startTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this actually the time in the kernel+pipeline?
If so, consider renaming this to something else, e.g. AvgEventProcessingTime

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be renamed. I think I originally included a subtraction of the former kernel time here, but it didn't work out. Anyway, that is why the original name was leftover.

@@ -631,8 +634,10 @@ func (t *Tracee) sinkEvents(ctx context.Context, in <-chan *trace.Event) <-chan
case <-ctx.Done():
return
default:
startTime := uint64(t.getOrigEvtTimestamp(event)) // convert back to monotonic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of this getOrigEvtTimestamp is discouraged since future fixes to the timestamp normalization may cause it to break. Also see here: #3820 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not ideal, but i'm not sure there's any better option until #3820 is resolved.

Comment on lines +21 to +22
AvgTimeInPipeline counter.Average
AvgTimeInKernel counter.Average
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider making these stats per-event type.
Different events have different behavior and processing time, and it would be much more informative to know about the average time of the different events.
WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how i've originally wanted to do it, but I couldn't find a good way to represent it in prometheus (ideally a histogram, yet I couldn't figure out at the time how to implement it with their SDK). If you find it critical, this PR should probably be closed and reintroduced with that implementation in mind.

Comment on lines +149 to +153
if err != nil {
return errfmt.WrapError(err)
}

return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
return errfmt.WrapError(err)
}
return nil
return errfmt.WrapError(err)


err = prometheus.Register(prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Namespace: "tracee_ebpf",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but we should consider renaming this namespace

@NDStrahilevitz
Copy link
Collaborator Author

NDStrahilevitz commented May 20, 2024

I'm thinking this PR should be redone at a later date, with a better measurement for kernel time, and split per event in a histogram (@yanivagman the time submission method you've shared with me would work well for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics: event time in userspace pipeline
3 participants