Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 7 Skymeld Regression #22233

Open
JohnRambo opened this issue May 3, 2024 · 6 comments
Open

Bazel 7 Skymeld Regression #22233

JohnRambo opened this issue May 3, 2024 · 6 comments
Assignees
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug

Comments

@JohnRambo
Copy link

JohnRambo commented May 3, 2024

Description of the bug:

Executing bazel test //... --keep_going on our repository when upgrading from Bazel 6.5.0 to 7.0.2 (or 7.1.1) causes extreme slowdown.

Appending --experimental_merged_skyframe_analysis_execution=false seems to fix the issue. Suggesting it's a problem with Skymeld. Note that we had skymeld enabled on Bazel 6.5.0 without problems.

The issue manifests as a very slow one-at-a-time execution with checking cached actions showing up on CLI.

Bazel profile is full of these:
image

image

and traces like:
image

Looking at the JFR it seems like the time is spent here with a lot of map/string operations:
image

image

The problem seems to be stemming from this semaphore here.

We have a relatively large repo: Analyzing: 175831 targets (74369 packages loaded, 764940 targets configured) and I had to fix a bunch of things to move us to Bazel 7 so the bisect is not working great as the fixes are not backward compatible. I'm hoping this is enough for someone to know what's going on but happy to devote more time to identify the rootcause.

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I do not have a reproduction right now. Our repository is large and private, I'm happy to try and create something but would like to get some feedback on the issue first.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.0.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

It's a private repository

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Yes, 6.5.0 seems to work correctly wheres 7.0.2 fails but I was unable to bisect cleanly.

Have you found anything relevant by searching the web?

There seems to be bunch of reports around checking cached actions but they seem unrelated (like #21712)

I tried to search for Skymeld specific issues but only found the coverage issue here which would also be a blocker for us to upgrading but is unrealted.

Any other information, logs, or outputs that you want to share?

No response

@joeleba
Copy link
Member

joeleba commented May 6, 2024

Thanks for filing the bug. Could you please try out Bazel at HEAD to see if this is still an issue? We recently made some changes to this part of the code.

The issue manifests as a very slow one-at-a-time execution with checking cached actions showing up on CLI.

Hmm, part of skymeld's action conflict check is sequential for correctness reason, but this should not affect action execution. What's your --jobs value?

The "acquiring-semaphore" part is interesting. It should be either one of these instances SkyframeActionExecutor or RemoteExecutionService. Potentially related: #19924

P/S: How large is your local action cache size?

@JohnRambo
Copy link
Author

JohnRambo commented May 6, 2024

Could you please try out Bazel at HEAD to see if this is still an issue?

I built bazel from source on this sha 073e70188d3e09325c6531f50629abcc522ac425 on mainline. I am still observing similar behavior looking at the CLI but let me know if you'd like me to collect the same / additional info on the HEAD run.

What's your --jobs value?

When executing on RBE (which this does) we set it to --jobs=1024

How large is your local action cache size?

Not sure I understand your question so correct me but I think you are talking about the disk cache? I haven't mentioned it but I tried quite a few things to narrow it to skymeld, one of which was disabling local disk cache with --disk_cache= and that did not help.

@joeleba
Copy link
Member

joeleba commented May 7, 2024

Could you please provide additional info on your machine? (OS, CPU type, memory, number of cores, ...), as well as the bazel JSON trace profile that you showed above?

Not sure I understand your question so correct me but I think you are talking about the disk cache? I haven't mentioned it but I tried quite a few things to narrow it to skymeld, one of which was disabling local disk cache with --disk_cache= and that did not help.

The terminology is a bit conflated, but I meant the action cache i.e. --[no]use_action_cache.

Another issue possibly related to the chain of "acquiring semaphore" is is #20478 . Can you please try your build again with --noexperimental_throttle_remote_action_building?

@JohnRambo
Copy link
Author

Could you please provide additional info on your machine? (OS, CPU type, memory, number of cores, ...)

CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 6

OS
GNU/Linux

as well as the bazel JSON trace profile that you showed above

I want to be careful about that because it contains some private information like project names. Let me see if I can redact it in some way.

The terminology is a bit conflated, but I meant the action cache i.e. --[no]use_action_cache.

It still reproduces with --nouse_action_cache

Another issue possibly related to the chain of "acquiring semaphore" is is #20478 . Can you please try your build again with --noexperimental_throttle_remote_action_building?

Same here.

@joeleba
Copy link
Member

joeleba commented May 8, 2024

Could you also provide a diff of the time spent on BuildDriverFunction.checkActionConflict between 6.5.0 and 7.0.2? Just to confirm that we indeed spend more time there and the extent of the regression.

@joeleba
Copy link
Member

joeleba commented May 13, 2024

Another question: did you see any action conflict being reported in your build? You can either look in the terminal output, or do a run with --nokeep_going.

@meisterT meisterT added more data needed P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug
Projects
None yet
Development

No branches or pull requests

6 participants