[Do not merge] CI test, cmake option to enable address santitizer. #2890

bernhardu · 2021-06-10T03:49:40Z

This are the patches of my recent address santizer work.

With these all tests pass except two.
But at least some take much longer, the whole run around 24 minutes instead of 9.

The tests nested_detach-no-syscallbuf and nested_detach-32-no-syscallbuf do not finish even with 980 seconds timeout, and given the tests nested_detach and nested_detach-32 finish also with asan in a few seconds, I guess they show a hanging test.

What do you think? Is there something already suitable for merging?

bernhardu · 2021-06-10T11:28:31Z

From #2883: We would also need to be able to run rr with the ASAN library dynamically linked, but NOT pass that library name through to rr's spawned children.

Here "dynamically linked" means this?

$ ldd bin/rr
        libasan.so.6 => /usr/lib/x86_64-linux-gnu/libasan.so.6 (0x00007f730ea2e000)
...

Is this "NOT pass that library name" just related to LD_PRELOAD? Could you please give some more details?

rocallahan · 2021-06-10T11:39:52Z

Is this "NOT pass that library name" just related to LD_PRELOAD? Could you please give some more details?

Ah, I was thinking that ASAN uses LD_PRELOAD, but I guess it doesn't the way you're building it.

bernhardu · 2021-06-10T11:49:49Z

Ah, I was thinking that ASAN uses LD_PRELOAD, but I guess it doesn't the way you're building it.

That would be at least needed when inspecting a library built with ASAN with an executable built without.
But yes, while working at the strip_outer_ld_preload change, I got the impression that ASAN prepends itself to LD_PRELOAD when spawning childs to make sure it is the first library. But I could not trace any negative consequence back to it, when running the tests.

Bob131 · 2021-06-11T00:23:21Z

Dang, you beat me to it! I've also been working on this, but it seems I spent too much time messing with hacks ;)

I have some comments (and maybe some patches) that might help.

Bob131 · 2021-06-11T00:25:28Z

CMakeLists.txt

@@ -522,6 +522,12 @@ if (x86ish)
  set(RR_SOURCES ${RR_SOURCES} src/test/x86/cpuid_loop.S)
 endif()

+option(asan "Build with address santitizer enabled.")
+if (asan)
+  set(ASAN_FLAGS "-fno-omit-frame-pointer -fsanitize=address")


Is -fno-omit-frame-pointer necessary here? If this was because of less-than-useful stacktraces from leaks, there's ASAN_OPTIONS=fast_unwind_on_malloc=false. I'm wondering if -fno-omit-frame-pointer might be causing some of the test timeout issues?

Is -fno-omit-frame-pointer necessary here?

I am just using it because the usage example I looked at had it too. Have to test if that makes a difference.
At least even with this the stack e.g. in 7214d96 was not complete.

Bob131 · 2021-06-11T00:27:35Z

CMakeLists.txt

@@ -1539,7 +1539,7 @@ if(BUILD_TESTS)
  # reduced single-core performance.
  exec_program(cat ARGS "/proc/cpuinfo" OUTPUT_VARIABLE CPUINFO)
  string(REGEX MATCH "^.*(Xeon Phi).*$" CPU_MODEL_PHI ${CPUINFO})
-  if(NOT "${CPU_MODEL_PHI}" STREQUAL "")
+  if(NOT "${CPU_MODEL_PHI}" STREQUAL "" OR asan)


I wasn't seeing this kind of issue running with an ASan build. Would you mind checking if you're still getting timeouts without this commit and without -fno-omit-frame-pointer? (Sorry, I know the test suite speed is kinda painful)

I have removed -fno-omit-frame-pointer, but timeouts seem unchanged:

1265/2625 Test #869: conditional_breakpoint_offload-no-syscallbuf ...................... Passed 261.44 sec 2620/2625 Test #2358: x86/string_instructions_async_signals_shared-32 ................... Passed 264.50 sec 1280/2625 Test #868: conditional_breakpoint_offload .................................... Passed 274.21 sec 2586/2625 Test #2180: conditional_breakpoint_offload-32 ................................. Passed 329.61 sec 2/3 Test #1217: record_replay-no-syscallbuf ...... Passed 533.47 sec 2624/2625 Test #2528: record_replay-32 .................................................. Passed 574.54 sec 2625/2625 Test #2529: record_replay-32-no-syscallbuf .................................... Passed 608.18 sec 2621/2625 Test #1216: record_replay ..................................................... Passed 741.66 sec

With these tests still failing:

1203 - nested_detach-no-syscallbuf (Failed) 2515 - nested_detach-32-no-syscallbuf (Failed)

So from my point of view the -fno-omit-frame-pointer does not change much.

Just to avoid misunderstanding - I run the test suite also in parallel (ctest --verbose -j$(nproc)).
Therefore the it depends much of the related other tests that run at that time.

Bob131

This doesn't really solve the issue, and it actually masks the fact that support for running ASan binaries in rr is currently broken (I'll open a PR to fix that in a bit). This is actually what I've been stuck on for a few days.

I'm thinking maybe the right approach is to detect ASan-using binaries before they exec() and fix-up LD_PRELOAD there.

EDIT: Woops, this was regarding 23fafc1 (" Run child process with ASAN_OPTIONS=verify_asan_link_order=0. ")

Bob131 · 2021-06-11T00:36:46Z

src/Task.cc

@@ -3671,6 +3671,12 @@ void Task::dup_from(Task *other) {
      create_mapping(this, remote_this, km);
      LOG(debug) << "Copying mapping into " << tid;
      if (!(km.flags() & MAP_SHARED)) {
+#if defined(__x86_64__) && defined(__SANITIZE_ADDRESS__)


I think possibly masks a real issue that maybe should be investigated? I haven't looked into it myself yet.

Yes this is just a workaround for asan mapping a really huge area, which cannot be allocated by plain malloc.
Maybe rr has to reproduce this mapping and copy just the "used" memory. Have not looked any further into this.

Today I tried to modify the copy_mem_mapping to not try to allocate the whole chunk in one, instead to copy by e.g. 256 MB blocks. Unfortunately this did just start to fill RAM.
I guess if we really try to duplicate such a 20 TB "MAP_NORESERVE" mapping, a way is needed to tell pages which got accessed once and therefore are "in use" apart from these pages, which never got touched, and therefore the kernel did never allocate a hardware page for it. (If I understand this situation right?)

One thing you can do here is read /proc/<pid>/pagemap and check if the page frame number for each anonymous page is zero. If it is, the page is all zeroes.

Thanks for the hint.
I found the page frame number (PFN), bits 0-54 were all 0.
By testing the bit 63 "page present" seems to better show if something is in use.
Unfortunately testing the whole 20 TB increases the test time from 3 seconds to nearly 400 seconds.
And all in all there seem just around 126 pages of it in use.
So I guess this is not the final solution.
88792e9 contains my attempt to use /proc/pid/pagemap.

That program processes 1M pages per read (so 4GB instead of 1GB per read), and uses pread() to halve the number of syscalls, but I don't think that would explain why my program is 20x faster than your code.

Your code calls page_size(), i.e. sysconf(_SC_PAGE_SIZE), twice per chunk (and twice per page present!) ... maybe that's really slow? Or maybe building rr non-optimized, the std::vector operations (size() and operator[]) per page, are really slow.

Thanks for your sample code.
I tried to do some modifications:

4m55: before

5m18: Assigned page_size() to a local variable (tested just once, seems I did something in parallel)

2m10: replaced the vector by a c-array

Then I changed the loop to your version (I hope it is ok to use it?).
And then I noticed, that I commented the && km.size() >= 0x100000000 and therefore used
copy_mem_mapping_just_used for all MAP_NORESERVE mappings,
which might not use holes and therefore get copied slowly page by page.
If I just use it for I got it down to 35 seconds.

Did you use pread instead of lseek/read? You should.

copy_mem_mapping_just_used for all MAP_NORESERVE mappings,
which might not use holes and therefore get copied slowly page by page.

Seems like the right thing to do is to identify runs of consecutive pages that are present, and handle each run with a single copy_mem_mapping call. I think you should use copy_mem_mapping_just_used for all anonymous mappings (which you're not checking for currently!), not just MAP_NORESERVE, and not just for big mappings.

If I remember correctly I first found that just the first lseek with read was needed, so moved lseek before the loop.
But I could not see an improvement in time. Similar for removing lseek and changing read to pread.
Then I ripped out my version and used your lines.

I tried to add this identification of consecutive pages and call it for all sizes of ANONYMOUS|NORESERVE mappings.

bernhardu · 2021-06-11T11:21:24Z

This doesn't really solve the issue, and it actually masks the fact that support for running ASan binaries in rr is currently broken (I'll open a PR to fix that in a bit). This is actually what I've been stuck on for a few days.

I'm thinking maybe the right approach is to detect ASan-using binaries before they exec() and fix-up LD_PRELOAD there.

EDIT: Woops, this was regarding 23fafc1 (" Run child process with ASAN_OPTIONS=verify_asan_link_order=0. ")

Because I was specifically interested in running just the tests against an asan enabled rr (where I intentionally did not build the tests with asan), I took this "shortcut".
This also gets just activated when building rr with asan. Running a asan enabled binary with a regular rr should still show the issue.

bernhardu · 2021-06-14T09:47:20Z

Rebased to current head, because a few cheap patches got merged separately. Remaining patches are unchanged.

Bob131 · 2021-06-14T15:31:25Z

src/RecordSession.cc

@@ -2230,6 +2230,10 @@ static string lookup_by_path(const string& name) {
    env.push_back("SYSTEMD_RDRAND=0");
  }

+#if defined(__SANITIZE_ADDRESS__)
+  env.push_back("ASAN_OPTIONS=verify_asan_link_order=0");


I've been thinking about this, and I'd suggest that this should be fine as a testsuite-specific hack provided there's a comment mentioning that

This is needed to run the test suite with an ASan-instrumented rr because of rr's lack of support for children execing ASan binaries, and should be removed if/when rr gains such support.

This is specific to the case of running the test suite and isn't supposed to help with general usage.

Please feel free to word it slightly less awkwardly ;)

Going to add a comment.

Bob131 · 2021-06-14T15:32:09Z

src/test/ignore_nested.c

@@ -5,11 +5,17 @@
 // environment. The runner uses this to test
 // --nested=ignore in the absence of RR_UNDER_RR
 // environment variable.
-int main(int argc, char *argv[]) {
+int main(int argc, char *argv[], char *envp[]) {


Ditto above re comment.

This whole change to the test seems not needed in a tree that contains your vdso soname patch, so I remove this from my draft.

bernhardu · 2021-06-22T20:50:54Z

@Bob131 I hope it is ok to integrate your patches to this draft?

bernhardu · 2021-06-24T12:20:26Z

src/RecordCommand.cc

@@ -629,7 +629,7 @@ static WaitStatus record(const vector<string>& args, const RecordFlags& flags) {
      flags.use_syscall_buffer, flags.syscallbuf_desched_sig,
      flags.bind_cpu, flags.output_trace_dir,
      flags.trace_id.get(),
-      flags.stap_sdt, flags.unmap_vdso);
+      flags.stap_sdt, flags.unmap_vdso, flags.asan);


@khuey While working at this draft branch I found that the asan flag is currently set but never used, as far as I see.
And I am guessing right, that this flag is intended to go into RecordSession::create?
Should I put this into a separate PR?

bernhardu · 2023-03-28T20:39:42Z

Just rebased the patches from 2021 to current git tip.
Half of them are still just for better control of running the tests.

Keno · 2023-03-28T22:32:57Z

Can we split this up into a few separate PRs?

Just the CMake option to turn on ASAN?
The sparse copy stuff
The preload changes

and then if we still can't figure out the test failures, the extra changes to disable tests, but I'd prefer to just go through and debug why it's not working. I think it'll make it easier to review and faster to merge in isolated pieces.

…force32bit to 55 minutes.

Error message: [FATAL src/MemoryRange.h:20:MemoryRange()] start_ <= end_

CI showed timeouts with test: madvise_fracture_flags (exceeded 120s, aarch64) nested_release (exceeded 120s) record_replay (exceeded 600s)

Remove those tests for now. No idea how that could be avoided? Error message: [FATAL src/PerfCounters.cc:785:read_ticks()] 1 (speculatively) executed strex instructions detected. Example: https://buildkite.com/julialang/rr/builds/1141 BUILDKITE_AGENT_NAME="default-armageddon.3"

Avoids error message: ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD. Affected tests: nested_detach nested_detach_kill nested_detach_kill_stuck nested_detach_stop nested_detach_wait

Bob131 reviewed Jun 11, 2021

View reviewed changes

Bob131 mentioned this pull request Jun 11, 2021

Some ASan-related fixes #2892

Merged

bernhardu force-pushed the cmake-option-asan branch from c32472e to 1ed97e6 Compare June 14, 2021 09:44

Bob131 reviewed Jun 14, 2021

View reviewed changes

bernhardu force-pushed the cmake-option-asan branch from 1ed97e6 to 17bae24 Compare June 22, 2021 20:49

bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from ab793c9 to 30e13b8 Compare June 24, 2021 12:10

bernhardu commented Jun 24, 2021

View reviewed changes

bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from 60271f8 to 6655052 Compare June 25, 2021 13:36

bernhardu force-pushed the cmake-option-asan branch from 6655052 to 51f997e Compare July 5, 2021 14:57

bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from c87b21b to 1934fe4 Compare August 18, 2021 14:15

bernhardu force-pushed the cmake-option-asan branch from 1934fe4 to 1d0adda Compare March 28, 2023 20:22

bernhardu mentioned this pull request Mar 28, 2023

Use after free below update_syscallbuf_fds_disabled #3475

Closed

bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from 9444af3 to b2e9c5e Compare April 1, 2023 09:56

bernhardu force-pushed the cmake-option-asan branch 6 times, most recently from 1c9362c to 66f32e2 Compare April 22, 2023 15:05

bernhardu changed the title ~~RFC: Add cmake option to enable address santitizer.~~ [Do not merge] CI test, cmake option to enable address santitizer. Apr 22, 2023

bernhardu force-pushed the cmake-option-asan branch 14 times, most recently from 6645058 to 5c1c898 Compare April 29, 2023 11:47

bernhardu force-pushed the cmake-option-asan branch from 5c1c898 to b7601ac Compare July 10, 2023 15:08

bernhardu force-pushed the cmake-option-asan branch from b7601ac to 737068c Compare August 13, 2023 13:54

bernhardu added 6 commits September 10, 2023 15:14

buildkite: Add asan and force32bit variants to CI tests.

0b1a774

buildkite: Add per variant timeout value and increase aarch asan and …

ea7cb3c

…force32bit to 55 minutes.

Attempt to avoid assert in force32bit.

ac0227e

Error message: [FATAL src/MemoryRange.h:20:MemoryRange()] start_ <= end_

ASan: Increase timeout value.

930ce5b

CI showed timeouts with test: madvise_fracture_flags (exceeded 120s, aarch64) nested_release (exceeded 120s) record_replay (exceeded 600s)

bernhardu force-pushed the cmake-option-asan branch from 737068c to cc0d479 Compare September 10, 2023 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge] CI test, cmake option to enable address santitizer. #2890

[Do not merge] CI test, cmake option to enable address santitizer. #2890

bernhardu commented Jun 10, 2021

bernhardu commented Jun 10, 2021 •

edited

rocallahan commented Jun 10, 2021

bernhardu commented Jun 10, 2021 •

edited

Bob131 commented Jun 11, 2021

Bob131 Jun 11, 2021

bernhardu Jun 11, 2021

Bob131 Jun 11, 2021

bernhardu Jun 11, 2021

bernhardu Jun 22, 2021

Bob131 left a comment •

edited

Bob131 Jun 11, 2021

bernhardu Jun 11, 2021

bernhardu Jun 22, 2021

rocallahan Jun 22, 2021

bernhardu Jun 23, 2021

rocallahan Jun 24, 2021

rocallahan Jun 24, 2021

bernhardu Jun 25, 2021

rocallahan Jun 25, 2021

bernhardu Jun 25, 2021

bernhardu commented Jun 11, 2021

bernhardu commented Jun 14, 2021

Bob131 Jun 14, 2021

bernhardu Jun 22, 2021

Bob131 Jun 14, 2021

bernhardu Jun 22, 2021

bernhardu commented Jun 22, 2021

bernhardu Jun 24, 2021

bernhardu commented Mar 28, 2023

Keno commented Mar 28, 2023

[Do not merge] CI test, cmake option to enable address santitizer. #2890

Are you sure you want to change the base?

[Do not merge] CI test, cmake option to enable address santitizer. #2890

Conversation

bernhardu commented Jun 10, 2021

bernhardu commented Jun 10, 2021 • edited

rocallahan commented Jun 10, 2021

bernhardu commented Jun 10, 2021 • edited

Bob131 commented Jun 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bob131 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernhardu commented Jun 11, 2021

bernhardu commented Jun 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernhardu commented Jun 22, 2021

Choose a reason for hiding this comment

bernhardu commented Mar 28, 2023

Keno commented Mar 28, 2023

bernhardu commented Jun 10, 2021 •

edited

bernhardu commented Jun 10, 2021 •

edited

Bob131 left a comment •

edited