Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share work across configurations by setting working directory for actions then canonicalizing that working directory in RE #611

Open
thetimmorland opened this issue Mar 28, 2024 · 5 comments

Comments

@thetimmorland
Copy link

Buck2 struggles to share work between configurations due to the configuration hash appearing in output artifacts. Because an output artifact always appears in the command line, two configurations cannot share an action, even if the inputs and other command line arguments that action are identical; the action digests sent to remote execution (RE) will always differ, sometimes solely because of the output artifact path.

To increase sharing across configurations one approach would be to:

  1. Set up each action with the current working directory set to buck-out/v2/gen/root/$HASH and relativize all artifact paths appropriately.
  2. Enable canonicalization of the working directory in RE (eg buck-out/v2/gen/root/$HASH becomes buck-out/v2/gen/root/00000000000000 regardless of the config)

To make things more concrete, the action PWD=. cc main.c -o buck-out/v2/gen/root/200212f73efcd57d/__main__/main would become PWD=buck-out/v2/gen/root/200212f73efcd57d cc ../../../../../main.c -o __main__/main, but under RE it would actually run as PWD=buck-out/v2/gen/root/00000000000000 cc ../../../../../main.c -o __main__/main.

With these changes the configuration hash should not affect RE action digests, so if two configurations happen to perform the same action, duplicate work is avoided. Since the compilation occurred with relative paths it should be safe to materialize it into a different directory than RE wrote to without breaking debug info or other path information which may leak into the output artifact.

The inspiration for this approach comes from https://github.com/bazelbuild/reclient which supports a -canonicalize_working_dir flag.

I just wanted to share the idea. If it is a smallish change maybe I could take a stab at it.

Thanks so much for all your hard work open sourcing buck2!

@cjhopman
Copy link
Contributor

I think a challenge here is that basically every action has inputs that are produced by other actions. You would need to find a way to canonicalize those as well, and that's really difficult. Bazel had an interesting approach they were experimenting with described here: https://docs.google.com/document/d/17snvmic26-QdGuwVw55Gl0oOufw9sCVuOAvHqGZJFr4/edit#heading=h.5mcn15i0e1ch. I'm not sure what the status of that is today.

We're planning on experimenting with another approach for this sometime this year. There's a lot of complications (that bazel doc talks about a few and has comments on some others).

@thetimmorland
Copy link
Author

think a challenge here is that basically every action has inputs that are produced by other actions. You would need to find a way to canonicalize those as well, and that's really difficult.

Assuming the build does not use transitions wouldn't generated inputs already be canonicalized?

PWD=buck-out/v2/gen/root/00000000000000 ./../../../../gen_src.py -o __gen_src__/generated.c
PWD=buck-out/v2/gen/root/00000000000000 cc __gen_src__/generated.c -o __main__/main

@cjhopman
Copy link
Contributor

I had assumed you meant a model where the action sees its output with path 00000000000 but then it gets rewritten to the real path, because it keeping the 00000000 path just completely doesn't work. Different actions need to produce different output paths, and that needs to be some deterministic mapping in the context of any possible build.

@zjturner
Copy link
Contributor

this sounds very similar to the problem I’ve discussed a bunch of times in the past, where it’s very difficult to apply different configurations per target, so you’re forced into global configuration that affects every targets hash even when it’s unnecessary.

ive had some success with transition rules to strip out unnecessary constraints, and they’ve discussed implementing a feature called “configuration trimming” to automatically strip unnecessary constraints, but no guidance yet on if or when that will actually happen

@thetimmorland
Copy link
Author

I had assumed you meant a model where the action sees its output with path 00000000000 but then it gets rewritten to the real path, because it keeping the 00000000 path just completely doesn't work. Different actions need to produce different output paths, and that needs to be some deterministic mapping in the context of any possible build.

From buck2's perspective, each action still has a unique output path. However, by applying the transformation described above before sending an action to RE and reversing it when you receive your response (there is some bookkeeping required for this), you are able to share cache hits between configurations by hiding the configuration hash from RE.

ive had some success with transition rules to strip out unnecessary constraints, and they’ve discussed implementing a feature called “configuration trimming” to automatically strip unnecessary constraints, but no guidance yet on if or when that will actually happen

Yes, this definitely achieves a similar purpose to configuration trimming, either manual or automatic. I've read your previous threads and they've been very helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants