-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(pkg/bpf): Replace libbpfgo with cilium/ebpf #1438
base: main
Are you sure you want to change the base?
Conversation
5fa556a
to
42303d8
Compare
@rootfs @dave-tucker which release is this targeted for? |
I don't have a target in mind. But I doubt it will be ready for review until Friday at the earliest. |
08d0a91
to
d031ddd
Compare
I thought I was close, but I think I found an issue in #1443 that causes failures when perf events can't be opened for some reason. I've cherry-picked that here to see if it makes CI green - it applies to the libbpfgo code to so it's another PR. |
🤖 SeineSailor Here is a concise summary of the pull request changes: Summary: This pull request replaces Key Modifications:
Impact on Codebase:
Observations and Suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@dave-tucker we may want to look at this crash before merging.
|
I get the same using the
|
@sthaha I just force pushed to fix a small bug in the cleanup logic (forgetting to clear the I've tried locally on Fedora 40 with both running the binary and compose and can't reproduce. I've also tried on RHEL 9.4 and RHEL 9.2 and still can't reproduce. I'm also not seeing anything in the logs stating that any of the eBPF probes couldn't be loaded 🤔 TL;DR I'm a bit stuck on fixing this as I can't reproduce the issue. |
I tried with the latest, commit 6d8539b and runs fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Will let @vimalk78 , @marceloamaral, or @rootfs take another look before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for sharing comments a bit late
pkg/bpf/kepler_bpfel.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything is same in kepler_bpfeb.go
and kepler_bpfel.go
, only difference being the embedded bpf.o object file, and go build tags.
not sure if allowed by generator, but can we keep only var _KeplerBytes []byte
in these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, these are generated files and we're not allowed to touch them.
pkg/bpf/exporter.go
Outdated
for _, m := range specs.Maps { | ||
// Only resize maps that are of type PerfEventArray | ||
if m.Type == ebpf.PerfEventArray { | ||
m.MaxEntries = uint32(numCPU) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
earlier we were resizing all cpu related arrays. i compared using bpftool
from main
97: perf_event_array name cpu_instruction flags 0x0
key 4B value 4B max_entries 16 memlock 448B
pids kepler-main(445082)
98: array name cpu_instruction flags 0x0
key 4B value 8B max_entries 16 memlock 448B
btf_id 348
pids kepler-main(445082)
from this branch
75: perf_event_array name cpu_instruction flags 0x0
key 4B value 4B max_entries 16 memlock 448B
pids kepler(444822)
76: array name cpu_instruction flags 0x0
key 4B value 8B max_entries 128 memlock 1344B
btf_id 311
pids kepler(444822)
there are similar differences in other array sizes also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name of bss map seems changed, size is also different
main branch:
106: array name kepler.bss flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 348
pids kepler-main(445082)
this branch:
72: array name .bss flags 0x0
key 4B value 4B max_entries 1 memlock 328B
btf_id 307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bss map size doesn't matter, but let me take a look at the other one and make sure it gets resized correctly.
|
||
klog.Infof("Successfully load eBPF module from libbpf object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line was useful in checking if ebpf was loaded successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if possible can we get some logs from cillium? earlier there used to be logs from libbpf. could be useful to debug any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the log line back.
I don't think we want to log everything from cilium/ebpf and the ebpf verifier (as libbpfgo did) as it's quite noisy - and not really useful for general operation (i.e when there are no errors).
We do still get useful log messages when there's been a failure though - I've used that a few times due to mistakes in this PR 😄
unixClosePerfEvents(e.cpuCyclesPerfEvents) | ||
e.cpuCyclesPerfEvents = nil | ||
unixClosePerfEvents(e.cpuInstructionsPerfEvents) | ||
e.cpuInstructionsPerfEvents = nil | ||
unixClosePerfEvents(e.cacheMissPerfEvents) | ||
e.cacheMissPerfEvents = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to close taskClockPerfEvents
also here?
func (e *exporter) CollectCPUFreq() (map[int32]uint64, error) { | ||
return make(map[int32]uint64), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can discuss in community and delete the feature altogether.
containerID, err := cgroup.GetContainerID(ct.CGroupID, ct.PID, config.EnabledEBPFCgroupID) | ||
containerID, err := cgroup.GetContainerID(ct.CgroupId, ct.Tgid, config.EnabledEBPFCgroupID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a significant change, can we keep it in a separate commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but I'd have to put in a commit before hand that fixes the (deliberate?) field mixup with the current struct.
Maybe I'll do that anyway (outside of this PR) or we can do it in #1481 which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented in the PR #1481, we should use PID as the Kernel TGID
So that we can remove the TGID from the user space.
cilium/ebpf is a pure Go eBPF package and is used in a number of popular cloud-native projects. The benefits to Kepler are: 1. Bytecode is built using bpf2go and the C and Go structs are kept in sync automatically 2. There is no requirement for Cgo anymore and therefore no requirement to have libbpf or libelf installed to compile and/or to be dynamically linked at runtime 3. Simplified packaging as the correct bytecode is contained within the kepler binary Overall I'm happy with this change, but there is only one thing that bugs me. We have to check in the bytecode object files (e.g kepler.bpfel.o) or the Go tooling (go lint/go vet) complains about the missing files. I couldn't reliably get `go generate ./...` to work to compile these files in CI. This is something which should be relatively easy to fix in the Makefile/CI environment before we cut a release. Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
This is ready to go. I'm holding off doing a rebase though until #1481 has merged so marking this as a draft until then. |
klog.Infof("Number of CPUs: %d", numCPU) | ||
for _, m := range specs.Maps { | ||
// Only resize maps that have a MaxEntries of NUM_CPUS constant | ||
if m.MaxEntries == 128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check condition by map name prefix if not by map name? the size comparison does not reflect why are we resizing the map as per number of cpus.
cilium/ebpf is a pure Go eBPF package and is used in a number of popular
cloud-native projects. The benefits to Kepler are:
sync automatically
to have libbpf or libelf installed to compile and/or to be
dynamically linked at runtime
kepler binary
Overall I'm happy with this change, but there is only one thing that
bugs me.
We have to check in the bytecode object files (e.g kepler.bpfel.o) or
the Go tooling (go lint/go vet) complains about the missing files.
I couldn't reliably get
go generate ./...
to work to compile thesefiles in CI. This is something which should be relatively easy to fix
in the Makefile/CI environment before we cut a release.
Depends-On: #1435
Fixes: #1424