Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trim-paths): trim SO and DW_AT_comp_dir symbols for root DI node #118518

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

weihanglo
Copy link
Member

@weihanglo weihanglo commented Dec 1, 2023

This is one way to fix #117652.

The issue

When --remap-path-scope=object is specified, user expect that there is
no local path embedded in final executables. Under object scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
split-debuginfo=packed|unpacked is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work.

  • On Linux, the root DW_AT_comp_dir of a compile unit seems to go into the binary executables.
  • On macOS, SO symbols are embedded in binary executables and libraries regardless a split-debuginfo file is built. Each SO symbol contains a path to the root source file of a debug info compile unit.

The approach

Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of unsplit-debuginfo
is present, as if it is kinda a "unsplit" debuginfo.

Unresolved issue

  • Not sure where to add more test to consolidate it.
  • Haven't investigate if we should apply the same logic to cranelift here.
  • Not sure if there is any other consequence doing this, but AFAIK debugging still works on macOS and Linux with profile.dev.trim-paths="object" fairly (with some extra tweaks in cargo).
  • This might be related?
    from https://github.com/llvm/llvm-project/blob/847d8457d16a7334ba39bdd35c70faa1b295304d/clang/lib/CodeGen/CGDebugInfo.cpp#L623-L631
    The DIFile used by the CU is distinct from the main source
    file. Its directory part specifies what becomes the
    DW_AT_comp_dir (the compilation directory), even if the source
    file was specified with an absolute path.
    

@rustbot
Copy link
Collaborator

rustbot commented Dec 1, 2023

r? @b-naber

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 1, 2023
@weihanglo
Copy link
Member Author

cc @Urgau if you're interested.

@weihanglo weihanglo added O-macos Operating system: macOS A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) F-trim-paths Feature: trim-paths labels Dec 1, 2023
@weihanglo weihanglo marked this pull request as draft December 4, 2023 19:51
@weihanglo
Copy link
Member Author

Not the right fix…

@weihanglo
Copy link
Member Author

r? @pnkfelix

as you're one member of debugging wg
Thanks!

@rustbot rustbot assigned pnkfelix and unassigned b-naber Dec 5, 2023
@weihanglo weihanglo marked this pull request as ready for review December 5, 2023 14:48
@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols when unsplit-debuginfo specified fix(trim-paths): remove SO symbols when object scope specified Dec 5, 2023
@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols when object scope specified fix(trim-paths): remove SO symbols for object scope Dec 5, 2023
@weihanglo weihanglo marked this pull request as draft December 7, 2023 23:41
@weihanglo
Copy link
Member Author

Find a way that can fix issue on both Linux and Darwin.

@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols for object scope fix(trim-paths): trim SO and DW_AT_comp_dir symbols for root DI node Dec 8, 2023
@weihanglo
Copy link
Member Author

@bors try

@bors
Copy link
Contributor

bors commented Dec 8, 2023

⌛ Trying commit 4fa22f1 with merge 4311f2e...

bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 8, 2023
fix(trim-paths): trim `SO` and `DW_AT_comp_dir` symbols for root DI node

This is one way to fix <rust-lang#117652>.

## The issue

When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work.

* On Linux, the root `DW_AT_comp_dir` of a compile unit seems to go into the binary executables.
* On macOS, `SO` symbols are embedded in binary executables and libraries regardless a split-debuginfo file is built. Each `SO` symbol contains a path to the root source file of a debug info compile unit.

## The approach

Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of `unsplit-debuginfo`
is present, as if it is kinda a "unsplit" debuginfo.

## Unresolved issue

* Not sure where to add more test to consolidate it.
* Haven't investigate if we should apply the same logic to cranelift [here](https://github.com/rust-lang/rust/blob/64d7e0d0b61c460fbc882ae37c0f236756dd9c39/compiler/rustc_codegen_cranelift/src/debuginfo/mod.rs#L68-L80).
* Not sure if there is any other consequence doing this, but AFAIK debugging still works on macOS and Linux  with `profile.dev.trim-paths="object"` fairly (with some extra tweaks in cargo).
@bors
Copy link
Contributor

bors commented Dec 8, 2023

☀️ Try build successful - checks-actions
Build commit: 4311f2e (4311f2ea0060c6b2a4a6abaaf3199c9e4eed9f7a)

@weihanglo weihanglo marked this pull request as ready for review December 8, 2023 21:02
@weihanglo
Copy link
Member Author

This is ready for review again :)

@bors
Copy link
Contributor

bors commented Jan 12, 2024

☀️ Try build successful - checks-actions
Build commit: d2d2599 (d2d2599faf05c1fde0679d2fba8ffb44d7a97468)

Copy link
Member

@michaelwoerister michaelwoerister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR, @weihanglo!

It looks to me, like it does the correct thing. I've added a few suggestions on how to make the tests more robust.

I'm not quite clear yet on the situation on macOS. Do you have a link to some documentation on the various SO symbols?


// Path of working directory in the root DI node seems to be embedded in
// executables. Hence, we trim them when the scope of `unsplit-debuginfo`
// is present, as if it is kinda a "unsplit" debuginfo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been able to verify that this is indeed the case for DWARF dwps (i.e. the DW_AT_comp_dir is only contained in the binary, not in the dwp). However, I'm not sure that this behavior is guaranteed, so I think we should test that the dwp and dwo files don't contain the compiler's working directory at all.

dsymutil -s $(TMPDIR)/foo | grep $(TMPDIR) || exit 1 # expected: `grep $(TMPDIR)` to exit 1
dsymutil -s $(TMPDIR)/foo | (! grep $(HERE)) || exit 1
rm -rf $(TMPDIR)/foo.dSYM
rm $(TMPDIR)/$(call BIN,foo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest changing all of the remapping tests in this file to do a sanity check:

  • remap paths to something that is unlikely to be contained in the binary by accident (e.g. /my-remapped-src and /my-remapped-cwd instead of /a and /b)
  • check that /my-remapped-src and /my-remapped-cwd are contained in the various files, so that we make sure the tests would actually fail if something is wrong. These run-make tests are very prone to bugs where the tests silently pass without testing anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, but I don't think it is in a good state to merge.

If you look closer at the changes in tests in the last commit, you can see it is like a whack-a-mole game.

  • Some symbols in split debuginfo file got overly remapped in object scope, and some should be remapped but didn't.
  • So do unpacked debuginfo in .o files.
  • .debug-line doesn't seem to be handled well.
  • When unpacked .o object files contain debuginfo, do we consider them split debuginfo files?

That leads to a question: Does llvm provide such a granular API for rustc to implement RFC 3127?

(or maybe this is not a task for me 😞)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken an in-depth look but DIBuilder::createFile does not seem to offer anything.

I think it's time to re-evaluate this particular aspect of RFC 3127. Maybe we just need to remove the split-/unsplit-debuginfo distinction and just have a single debuginfo scope. That still seems practical to me. I've created a todo for me to post something to the tracking issue. Let's mark the PR as blocked for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Michael!

What I am worried about is that scope object and debug never actually work as defined in RFC 3127 unless we have some post-processing approaches. Reading through this post and other tools like clang and gcc, seems none of them offers such an option to deal with split-debuginfo files and binaries separately like RFC 3127 suggesting.

@@ -141,9 +205,12 @@ packed-remapped-single:
# - `.dwp` present
packed-remapped-scope:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test with scope split-debuginfo,split-debuginfo-path that greps the entire .dwp file for remapped paths (i.e. without going through objdump or readelf).

@michaelwoerister
Copy link
Member

By the way, it looks to my like the RemapFileNameExt::for_scope implementations should all use .intersects() instead of .contains(), right?

@Urgau
Copy link
Contributor

Urgau commented Jan 15, 2024

By the way, it looks to my like the RemapFileNameExt::for_scope implementations should all use .intersects() instead of .contains(), right?

I don't think so, the goal the for_scope function is to return a local or remapped filename if the scope passed was enabled not if some part of the scope was enabled [1].

Otherwise only enabling the MACRO scope and checking for the presence of the OBJECT scope (MACRO | UNSPLIT_DEBUGINFO | SPLIT_DEBUGINFO_PATH [2]) would return a remapped filename, which is not what we want.

@michaelwoerister
Copy link
Member

don't think so, the goal the for_scope function is to return a local or remapped filename if the scope passed was enabled not if some part of the scope was enabled.

Here is my reasoning for why I think this should be .intersect():

  1. We have the following method:

    fn for_scope(&self, sess: &Session, scopes: RemapPathScopeComponents) -> Self::Output<'_> {
          if sess.opts.unstable_opts.remap_path_scope.contains(scopes) {
              self.prefer_remapped_unconditionaly()
          } else {
              self.prefer_local()
          }
    }

    where sess.opts.unstable_opts.remap_path_scope denotes which scopes should be remapped and scopes denotes all the scopes a given filepath belongs too. For example, if the path will be emitted to split-debuginfo, then scopes will be SPLIT_DEBUGINFO, if the path will end up in both in split-debuginfo and unsplit-debuginfo (for example because LLVM will just emit that single string into both artifacts), then scopes will be SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO.

  2. If the given path will only end up in a single scope (which is usually the case), then .contains() and .intersects() will yield the same result, so it does not matter.

  3. If the path ends up in multiple scopes, some of which are enabled and some of which aren't, then we have to decide how to deal with that. We can either only apply the remapping if all scopes are enabled (= .contains()) or apply the remapping if any of the scopes are enabled (= .intersects()). In both cases we don't have a perfect match. However, I would argue that we should definitely err on the side of applying the remapping too much rather than too little, because remapping is used for privacy reasons. Otherwise we could end up with private information being embedded even though a seemingly correct remapping is specified.

Alternatively, we could assert that scopes only has a single bit set. I'm not sure if there are cases where a given path ends up in two different places (I thought that would be the case for the DW_AT_comp_dir but that turned out to be wrong).

@Urgau
Copy link
Contributor

Urgau commented Jan 16, 2024

I'm not sure I follow your logic for the for_scope function. In particular this part:

scopes denotes all the scopes a given filepath belongs too.

Does this mean that you associate a filepath with scopes? Because that not how I see things.

When I implemented the RFC in #115214 I just made all the places that remapped paths to be conditional on the scope(s) they applied to (when emitting a diagnostic it's DIAGNOSTICS, for macro it's MACROS, ...). A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

For example, if the path will be emitted to split-debuginfo, then scopes will be SPLIT_DEBUGINFO

The variable scopes is not user-defined, it's defined in the code for each specific location. Only sess.opts.unstable_opts.remap_path_scope is user-defined.

if the path will end up in both in split-debuginfo and unsplit-debuginfo (for example because LLVM will just emit that single string into both artifacts), then scopes will be SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO.

scopes cannot currently ever be that combination. (well technically yes, as nothing prevents another dev to test for it but it's not the case currently)

Alternatively, we could assert that scopes only has a single bit set.

Well that's currently the case, it's only ever called with the MACRO and DIAGNOSTICS scopes. So we could assert this in the function.

@michaelwoerister
Copy link
Member

Does this mean that you associate a filepath with scopes? Because that not how I see things.

When I implemented the RFC in #115214 I just made all the places that remapped paths to be conditional on the scope(s) they applied to (when emitting a diagnostic it's DIAGNOSTICS, for macro it's MACROS, ...). A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

I think I might have formulated things in a confusing way. With "filepath" I just mean the path that is being remapped, that is, the thing that implements RemapFileNameExt.

A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

I'm not sure I agree with that (although the distinction is subtle). In my mind, we use the concept of "scopes" to model where a given path will be emitted to. For example, if we are about to emit some path as part of a diagnostic message, the we use .for_scope(DIAGNOSTICS). It's not DIAGNOSTICS because it is being emitted as part of some diagnostics machinery -- it is DIAGNOSTICS because it will show up in diagnostic output.

By extension, if we are storing a path to some place (like debuginfo in LLVM IR) that might then end up in multiple output artifacts and these output artifacts are in different scopes (e.g. LLVM might write it both to the object file and to the .dwo), then the corresponding .for_scope() needs to take the union of all scopes that the path will end up in. This is the reason why something like SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO could validly occur.

@Urgau
Copy link
Contributor

Urgau commented Jan 16, 2024

I think I might have formulated things in a confusing way. With "filepath" I just mean the path that is being remapped, that is, the thing that implements RemapFileNameExt.

I see, using the same word for a (subtle) different meaning. Thanks for clearing that up.

It's not DIAGNOSTICS because it is being emitted as part of some diagnostics machinery -- it is DIAGNOSTICS because it will show up in diagnostic output.

👍 Agree.

By extension, if we are storing a path to some place (like debuginfo in LLVM IR) that might then end up in multiple output artifacts and these output artifacts are in different scopes (e.g. LLVM might write it both to the object file and to the .dwo), then the corresponding .for_scope() needs to take the union of all scopes that the path will end up in. This is the reason why something like SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO could validly occur.

I think we already have the logic for handling this in should_prefer_remapped_for_codegen() called by RemapFileNameExt::for_codegen.

I also wouldn't expect the need to call for_scope with the SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO scopes, but I see the principle. I'm not sure we'll ever run into it, but if we do, we can change the contains to intersect.

# - Debuginfo in binary file
# - `.o` deleted
# - `.dSYM` present
# - in binary, paths from `N_SO` (source files) and `N_OSO` (object files) shouldn be remapped
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @michaelwoerister. I've move #118518 (review) to this thread so we can follow the discussion better.

I'm not quite clear yet on the situation on macOS. Do you have a link to some documentation on the various SO symbols?

It's also unclear to me. I have no idea if there is an official doc for Mach-O debug symbols. Here is what I've found:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to read up on those links by the end of the week. If we can't get to a clear conclusion, I think we should merge the PR nonetheless, for the improved behavior it provides on Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, from reading through those links it sounds to me like:

  • SO symbols we can handle fine by remapping the paths generated by debuginfo.
  • OSO symbols are added by the linker. They are not needed after dsymutil has built the dSYM bundle. So it would be fine to remove them after building the dSYM bundle.

I think the best option would be to use the -oso_prefix linker option. My guess is that will not be a problem even on older macOS versions. But I don't know if we have a minimum supported XCode version (in addition to a minimum support macOS version).

Other than that, rewriting/removing the OSO symbols in a postprocessing step seems to be our only option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weihanglo, do you know how -Cstrip=debuginfo behaves on macOS? Does that remove the OSO symbols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs strip -S on macOS and removes OSO symbols successfully, and the .dSYM bundle can successfully find sources for a simple hello-world.rs example.

Should we always pass -Cstrip=debuginfo to deal with OSO symbols? I am not sure…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like -Cstrip=debuginfo is now the default unless debuginfo is requested. We could also do a similar thing for when trim-paths is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, but only when Cargo is involved. OSO is still there if using rustc directly.

(granted, people invoking rustc directly should be able to set -Cstript=debuginfo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do you mean that rustc should implicitly strip debuginfo under some trim-paths circumstances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking about the Cargo case. For direct rustc invocations, the defaults are less important, I think, because that's more of an advanced use case. The test cases here could simulate what we expect Cargo to do.

@michaelwoerister
Copy link
Member

Well that's currently the case, it's only ever called with the MACRO and DIAGNOSTICS scopes. So we could assert this in the function.

@Urgau, if you are interested in opening a PR that asserts that only a single scope is passed to for_scope, and that also adds a docstring saying that only a single scope is supported, you can r? me. Since for_scope currently accepts a set of scopes, I think it is a bit confusing at the moment. But a bit of documentation and an assertion, would be an easy fix for that.

@rustbot
Copy link
Collaborator

rustbot commented Jan 22, 2024

Failed to set assignee to me: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work on macOS. On macOS, `SO` symbols are embedded
in binary executables and libraries regardless a split-debuginfo file is
built. Each `SO` symbol contains a path to the root source file of a
debug info compile unit.

This commit demonstrates the case, and hope there is a fix soon.
When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work on Linux. On Linux, the root `DW_AT_comp_dir`
of a compile unit seems to go into the binary binary executables.

This commit demonstrates the case, and hope there is a fix soon.
Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of `unsplit-debuginfo`
is present, as if it is kinda a "unsplit" debuginfo.
@michaelwoerister michaelwoerister added S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 24, 2024
@michaelwoerister
Copy link
Member

Marking the PR as blocked until we've discussed whether the RFC can be implemented at all in its current form. See #118518 (comment)

fmease added a commit to fmease/rust that referenced this pull request Jan 24, 2024
…chaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? `@michaelwoerister`
@weihanglo weihanglo marked this pull request as draft January 24, 2024 15:36
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 25, 2024
…chaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? ``@michaelwoerister``
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 25, 2024
Rollup merge of rust-lang#120230 - Urgau:for_scope-single-scope, r=michaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? ``@michaelwoerister``
@bors
Copy link
Contributor

bors commented Mar 29, 2024

☔ The latest upstream changes (presumably #122450) made this pull request unmergeable. Please resolve the merge conflicts.

@weihanglo
Copy link
Member Author

I am a bit lost. @michaelwoerister, this is still worth pursuing, or at least I should verify the current situation again, right?

@Urgau
Copy link
Contributor

Urgau commented Mar 29, 2024

Some of the test changes look worth landing, even if the logic change is no longer necessary.

Better have more tests than not enough.

@michaelwoerister
Copy link
Member

It's definitely worth having tests for the trim-paths related functionality. Maybe its better to create a number of smaller test cases. It might also be possible to use regular UI tests for this. I recently saw something pretty clever in another PR where a UI test looks for its own PDB file:
https://github.com/rust-lang/rust/blob/95adb6d7b520230971522c8806cc6ab518e419c5/tests/ui/debuginfo/msvc-strip-debuginfo.rs

We could do the same, i.e. just load binaries and separate debuginfo files into memory and then treat them as byte slices where we look for the text we want to find (or don't want to find). That should be a lot more readable than run-make tests. If compiletest allows it, we can also put some shared code into a common aux-crate.

Regarding what to test: I suggest picking the most relevant configurations from the table in the tracking issue and then have a test for linux, macOS, and MSVC. For each useful (platform, trim-paths, split-debuginfo) combination, we compile with the commandline options that we expect Cargo to pass for that combination and then make sure the binaries and separate debuginfo files are either sanitized or not sanitized. I don't think we need to test the less useful combinations.

It might be good to split this into a number of smaller PRs, starting with a simple case just for Linux or macOS. Feel free to assign to me for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) F-trim-paths Feature: trim-paths O-macos Operating system: macOS S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only with all or split-debuginfo can -Zremap-path-scope remap SO symbols on macOS
7 participants