feat(metrics): add pipeline average time metrics #3845

NDStrahilevitz · 2024-02-04T17:03:35Z

1. Explain what the PR does

4aed8fc feat(metrics): add pipeline average time metrics

    Add two prometheus gauges measuring the following metrics:
    1. Average time spent from kernel to decoding
    2. Average time spent from kernel to publishing

2. Explain how to test it

tracee --metrics
Enter localhost:3366/metrics
Check the pipeline metrics

3. Other comments

Resolve #3844

pkg/utils/time.go

pkg/ebpf/events_pipeline.go

pkg/metrics/stats.go

geyslan · 2024-02-21T11:00:42Z

We have this update #3875

Please rebase your PR against main to make use of the new workflow setup.

geyslan · 2024-04-03T15:27:08Z

@NDStrahilevitz I'm not able to run the github actions on this. If it's ready for review, could you rebase it again? Tks.

pkg/cmd/flags/server/server.go

Add two prometheus gauges measuring the following metrics: 1. Average time spent from kernel to decoding 2. Average time spent from kernel to publishing

yanivagman · 2024-05-19T13:33:12Z

pkg/ebpf/events_pipeline.go

 			if err := ebpfMsgDecoder.DecodeContext(&eCtx); err != nil {
 				t.handleError(err)
 				continue
 			}
+			startTimeKernel := eCtx.Ts
+			_ = t.stats.AvgTimeInKernel.Add(endTimeKernel - startTimeKernel)


This value is actually the time in the kernel + submit time + the time it took tracee to read the buffer, isn't it?

It would be kernel + submit time + time blocked in channel (what you meant by read time?). Note that the endpoint timestamp is taken before we decode the buffer.
This time blocked in channel is actually critical and I haven't considered it. This measurement should be rethought.

yanivagman · 2024-05-19T13:37:05Z

pkg/ebpf/events_pipeline.go

 				t.streamsManager.Publish(ctx, *event)
-				_ = t.stats.EventCount.Increment()
+				endTime := uint64(utils.GetMonotonicTime())
+				_ = t.stats.AvgTimeInPipeline.Add(endTime - startTime)


Isn't this actually the time in the kernel+pipeline?
If so, consider renaming this to something else, e.g. AvgEventProcessingTime

Yep, this should be renamed. I think I originally included a subtraction of the former kernel time here, but it didn't work out. Anyway, that is why the original name was leftover.

yanivagman · 2024-05-19T13:54:37Z

pkg/ebpf/events_pipeline.go

@@ -631,8 +634,10 @@ func (t *Tracee) sinkEvents(ctx context.Context, in <-chan *trace.Event) <-chan
 			case <-ctx.Done():
 				return
 			default:
+				startTime := uint64(t.getOrigEvtTimestamp(event)) // convert back to monotonic


The usage of this getOrigEvtTimestamp is discouraged since future fixes to the timestamp normalization may cause it to break. Also see here: #3820 (comment)

I agree it's not ideal, but i'm not sure there's any better option until #3820 is resolved.

yanivagman · 2024-05-19T13:58:22Z

pkg/metrics/stats.go

+	AvgTimeInPipeline counter.Average
+	AvgTimeInKernel   counter.Average


I think we should consider making these stats per-event type.
Different events have different behavior and processing time, and it would be much more informative to know about the average time of the different events.
WDYT?

That's how i've originally wanted to do it, but I couldn't find a good way to represent it in prometheus (ideally a histogram, yet I couldn't figure out at the time how to implement it with their SDK). If you find it critical, this PR should probably be closed and reintroduced with that implementation in mind.

yanivagman · 2024-05-19T13:59:22Z

pkg/metrics/stats.go

+	if err != nil {
+		return errfmt.WrapError(err)
+	}
+
+	return nil


Suggested change

if err != nil {

return errfmt.WrapError(err)

}

return nil

return errfmt.WrapError(err)

yanivagman · 2024-05-19T14:09:22Z

pkg/metrics/stats.go

+
+	err = prometheus.Register(prometheus.NewGaugeFunc(
+		prometheus.GaugeOpts{
+			Namespace: "tracee_ebpf",


Not directly related to this PR, but we should consider renaming this namespace

NDStrahilevitz · 2024-05-20T08:31:44Z

I'm thinking this PR should be redone at a later date, with a better measurement for kernel time, and split per event in a histogram (@yanivagman the time submission method you've shared with me would work well for this).

NDStrahilevitz added the milestone/v0.21.0 label Feb 4, 2024

NDStrahilevitz requested a review from yanivagman February 4, 2024 17:03

NDStrahilevitz self-assigned this Feb 4, 2024

github-actions bot added area/ebpf area/performance area/UX area/flags labels Feb 4, 2024

NDStrahilevitz added the kind/feature label Feb 5, 2024

geyslan reviewed Feb 5, 2024

View reviewed changes

pkg/utils/time.go Show resolved Hide resolved

pkg/ebpf/events_pipeline.go Outdated Show resolved Hide resolved

pkg/metrics/stats.go Show resolved Hide resolved

NDStrahilevitz force-pushed the time_in_pipeline branch from 4aed8fc to beebe58 Compare February 7, 2024 14:18

NDStrahilevitz force-pushed the time_in_pipeline branch from beebe58 to 2277090 Compare February 21, 2024 16:24

geyslan added milestone/v0.21.0 and removed milestone/v0.21.0 labels Apr 16, 2024

NDStrahilevitz force-pushed the time_in_pipeline branch from 2277090 to ba2da29 Compare April 24, 2024 10:27

NDStrahilevitz commented Apr 24, 2024

View reviewed changes

pkg/cmd/flags/server/server.go Show resolved Hide resolved

feat(metrics): add pipeline average time metrics

9b070ca

Add two prometheus gauges measuring the following metrics: 1. Average time spent from kernel to decoding 2. Average time spent from kernel to publishing

NDStrahilevitz force-pushed the time_in_pipeline branch from ba2da29 to 9b070ca Compare April 24, 2024 12:39

geyslan added milestone/v0.22.0 and removed milestone/v0.21.0 labels Apr 26, 2024

yanivagman requested changes May 19, 2024

View reviewed changes

NDStrahilevitz closed this May 21, 2024

NDStrahilevitz removed the milestone/v0.22.0 label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add pipeline average time metrics #3845

feat(metrics): add pipeline average time metrics #3845

NDStrahilevitz commented Feb 4, 2024

geyslan commented Feb 21, 2024

geyslan commented Apr 3, 2024

yanivagman May 19, 2024

NDStrahilevitz May 20, 2024

yanivagman May 19, 2024

NDStrahilevitz May 20, 2024

yanivagman May 19, 2024

NDStrahilevitz May 20, 2024 •

edited

yanivagman May 19, 2024

NDStrahilevitz May 20, 2024

yanivagman May 19, 2024

yanivagman May 19, 2024

NDStrahilevitz commented May 20, 2024 •

edited

		AvgTimeInPipeline counter.Average
		AvgTimeInKernel counter.Average

feat(metrics): add pipeline average time metrics #3845

feat(metrics): add pipeline average time metrics #3845

Conversation

NDStrahilevitz commented Feb 4, 2024

1. Explain what the PR does

2. Explain how to test it

3. Other comments

geyslan commented Feb 21, 2024

geyslan commented Apr 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NDStrahilevitz May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NDStrahilevitz commented May 20, 2024 • edited

NDStrahilevitz May 20, 2024 •

edited

NDStrahilevitz commented May 20, 2024 •

edited