Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fixed eBPF Feature Detection #1443

Merged
merged 1 commit into from
May 18, 2024

Conversation

dave-tucker
Copy link
Collaborator

@dave-tucker dave-tucker commented May 16, 2024

On working to migrate to cilium/ebpf I started to hit issues with
nil pointer dereferences that indicated that something odd was happening
between eBPF feature detection and metrics export.

I discovered that there were additional global variables in many of the
packages whose values are set before eBPF feature detection has
happened.

It appears that in these cases - through the various layers of
indirection - we may try and read/write metrics that do not exist
causing a nil pointer dereference.

There appear to be several places where we're accessing map keys
without checking they are nil - that will need to be fixed in a
follow up. This PR attempts to at least ensure that from the BPF
side that the keys we expect to exist, do actually exist in the
maps that matter.

eBPF feature detection how produces a bpf.SupportedMetrics
struct, which is now consumed in various places in the codebase
to ensure that the metrics we're exporting match those supported
by the eBPF exporter.

@dave-tucker dave-tucker marked this pull request as draft May 16, 2024 15:13
@dave-tucker dave-tucker force-pushed the hcmetrics branch 2 times, most recently from e5a2371 to e37a3c2 Compare May 16, 2024 16:49
@dave-tucker dave-tucker changed the title chore: Remove IsHCMetricsEnabled feat: Fixed eBPF Feature Detection May 16, 2024
@vimalk78
Copy link
Collaborator

pls share the NPE, and steps to reproduce

@dave-tucker
Copy link
Collaborator Author

NPE:

goroutine 71 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).SumAllAggrValues(0x16d0[40](https://github.com/sustainable-computing-io/kepler/actions/runs/9114066283/job/25057276989?pr=1438#step:11:41)0?)
	/workspace/pkg/collector/stats/types/types.go:144 +0x16
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtil(0x16d3460?, {0x179f7e0?, 0xc0000e81c0?}, {0x18f5588, 0xd}, {0x1b67f28, 0xc000066368})
	/workspace/pkg/metrics/utils/utils.go:1[41](https://github.com/sustainable-computing-io/kepler/actions/runs/9114066283/job/25057276989?pr=1438#step:11:42) +0x6ed
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtilizationMetrics(0x1?, {0x179f7e0, 0xc0000e81c0}, 0xc0029d8180?, 0x0)
	/workspace/pkg/metrics/utils/utils.go:48 +0xc5
github.com/sustainable-computing-io/kepler/pkg/metrics/container.(*collector).Collect(0xc0025f8ed0, 0xc000061f60?)
	/workspace/pkg/metrics/container/metrics.go:105 +0xd8
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:[45](https://github.com/sustainable-computing-io/kepler/actions/runs/9114066283/job/25057276989?pr=1438#step:11:46)5 +0x102
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 1
	/workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:5[47](https://github.com/sustainable-computing-io/kepler/actions/runs/9114066283/job/25057276989?pr=1438#step:11:48) +0xbab

utils.CollectResUtil is the offending function, where it's assuming that a hashmap entry exists - which it may not.

As far as I can tell the NPE was caused by the following chain of events:

  • CollectResUtilizationMetrics gets called via the prometheus collector
  • It iterates over several arrays of constants e.g consts.SCMetricNames and consts.IRQMetricNames and then calls down into utils.CollectResUtil
    • NOTE: config.IsIRQCounterMetricsEnabled() gets checked before it tries to do anything with the IRQ metrics
  • CollectResUtil takes an interface {} argument (in this case operating on a ContainerStats object)
  • It indexes directly into ContainerStats.ResourceUsage["bpf_net_tx_irq"] without checking it exists

In which circumstances would "bpf_net_tx_irq" not exist then?

  • We can trace NewContainerStats -> NewStats and see that the ResourceUsage map is populated from a global variable called AvailableBPFSWCounters
  • InitAvailableParamAndMetrics sets the global ☝️ after the bpf programs are loaded - it was empty beforehand

So the only way this can occur is when config.IsIRQCounterMetricsEnabled() == true BUT we don't have any of the IRQ keys in the array of supported eBPF software counters. I'll continue to debug here to see why that's happening.

I'm not sure what I managed to do with the cilium/ebpf port and with this patch to make that issue worse... but there is definitely something fishy happening 🐟 Still trying to untangle this to find out what's going on.

@vimalk78
Copy link
Collaborator

NPE:

on which commit? does main + this PR reproduce it? or main + cillium PR ?

@vimalk78
Copy link
Collaborator

NPE:

on which commit? does main + this PR reproduce it? or main + cillium PR ?

got it. main + cillium PR

@dave-tucker
Copy link
Collaborator Author

main + cillium PR ?

That will reproduce it (for now). I'm only able to reproduce it in CI so using this PR to help debug it.

@dave-tucker dave-tucker force-pushed the hcmetrics branch 3 times, most recently from 859120e to a0ebf8c Compare May 16, 2024 19:35
@dave-tucker dave-tucker marked this pull request as ready for review May 16, 2024 19:36
@dave-tucker
Copy link
Collaborator Author

dave-tucker commented May 16, 2024

This PR is ready to go.

As for the root cause, I ended up hooking up a debugger and I found out that the issue was that in #1413 I'd forgotten to attach one of the perf_events - a software event called (config.TaskClock). I've fixed this and I'll push into CI tomorrow to see if it resolves the NPE.

EDIT: It still doesn't explain why the bpf_irq_net_tx counter would NPE.

Either way the issue is basically because we have 2 competing definitions of "what's supported by ebpf". This PR fixes that.

pkg/bpf/exporter.go Outdated Show resolved Hide resolved
pkg/bpf/exporter.go Outdated Show resolved Hide resolved
pkg/bpf/exporter.go Outdated Show resolved Hide resolved
pkg/bpf/exporter.go Outdated Show resolved Hide resolved
@dave-tucker dave-tucker force-pushed the hcmetrics branch 3 times, most recently from 645ee20 to b79cf26 Compare May 17, 2024 10:21
On working to migrate to cilium/ebpf I started to hit issues with
nil pointer dereferences that indicated that something odd was happening
between eBPF feature detection and metrics export.

I discovered that there were additional global variables in many of the
packages whose values are set before eBPF feature detection has
happened.

It appears that in these cases - through the various layers of
indirection - we may try and read/write metrics that do not exist
causing a nil pointer dereference.

There appear to be several places where we're accessing map keys
without checking they are nil - that will need to be fixed in a
follow up. This PR attempts to at least ensure that from the BPF
side that the keys we expect to exist, do actually exist in the
maps that matter.

eBPF feature detection how produces a bpf.SupportedMetrics
struct, which is now consumed in various places in the codebase
to ensure that the metrics we're exporting match those supported
by the eBPF exporter.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
}
}

if !config.ExposeHardwareCounterMetrics {
klog.Infof("Hardware counter metrics are disabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about we Warn?

Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rootfs rootfs merged commit 0baec47 into sustainable-computing-io:main May 18, 2024
20 checks passed
@dave-tucker dave-tucker deleted the hcmetrics branch May 22, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants