Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in conformance/interfaces/pthread_mutex_init/1-2.c #9

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sbc100
Copy link
Contributor

@sbc100 sbc100 commented Nov 9, 2021

This test was running deadlk_issue several times and expecting the
returned global to be zero if pthread_mutex_lock has not returned,
however the resetting on returned to zero was happening only after the
sem_post. The main thread is calling sched_yeild which on most
platforms is enough to cover up this issue, but not with WebAssembly.

Some alternatives to landing this change:

  1. We could just disable this test.
  2. We could make sched_yield actualy do something rather than be
    a no-op.

This test was running deadlk_issue several times and expecting the
returned global to be zero if `pthread_mutex_lock` has not returned,
however the resetting on `returned` to zero was happening only after the
`sem_post`.  The main thread is calling `sched_yeild` which on most
platforms is enough to cover up this issue, but not with WebAssembly.

Some alternatives to landing this change:

1. We could just disable this test.
2. We could make sched_yield actualy do something rather than be
   a no-op.
@kripken
Copy link
Member

kripken commented Nov 9, 2021

What would our options be in 2 - what could we make sched_yield do?

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 9, 2021

I was thinking one option might be emscripten_thread_sleep(0)... however, no matter what we do it doesn't eliminate the race here.. the race is real even if sched_yeild hides it most of the time.

@kripken
Copy link
Member

kripken commented Nov 9, 2021

I see, thanks.

I'm not too familiar with the pthreads API and race semantics etc., so I'd need to spend a significant amount of time to check this in depth. But unless you do want a detailed review from me, this lgtm to land, and is better than the other options.

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 9, 2021

Lets see if @kleisauke agrees with my assessment that this is bug in the test code.

I'm temporarily disabled this test as part of emscripten-core/emscripten#13006 because that change seem to make this test more flaky (although I don't know why exactly).

Copy link
Collaborator

@kleisauke kleisauke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Somehow I couldn't reproduce this race when adding -DVERBOSE=2 here (due to proxying to the main thread?) or when testing this with PR emscripten-core/emscripten#13007 (perhaps moving pthread_{cancel,join} from JS to musl resolved this?).

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 10, 2021

Yes, this race is very sensitive to things like printf and tracing or any kind of syscall made in the main thread.

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 10, 2021

Do you agree that there is an inadvertent race here? i.e. the setting of returned = 0; in this location was not done for some deliberate reason right?

@kleisauke
Copy link
Collaborator

I'm not sure if this is an inadvertent race, debugging these types of flaky races is difficult. I also tried to debug this on Alpine (which also uses musl), but with or without this change it always freezes on joining the first deadlk_issue thread.

See for details:

Details
$ cd tests/third_party/posixtestsuite
$ docker run --cap-add=SYS_PTRACE -v $(pwd):/app -w /app -it alpine:latest sh
/app # apk add build-base gdb
/app # make build-tests CFLAGS='-DVERBOSE=2' LDFLAGS='-pthread' POSIX_TARGET='conformance/interfaces/pthread_mutex_init'
/app # gdb ./conformance/interfaces/pthread_mutex_init/1-2.test
(gdb) run
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test 
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
 mutex 1 unlocking returned 0
 mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 174]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 174 exited]
[New LWP 175]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
 mutex 1 returned 0
 mutex 2 returned 0
Creating thread (deadlk)...
[LWP 175 exited]
[New LWP 176]
Thread releases the semaphore...
Cancel thread...
Thread canceled...

Thread 4 "1-2.test" received signal SIG33, Real-time event 33.
[Switching to LWP 176]
0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x0000000000000008 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) handle SIG33 nostop noprint pass
Signal        Stop	Print	Pass to program	Description
SIG33         No	No	Yes		Real-time event 33
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test 
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
 mutex 1 unlocking returned 0
 mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 178]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 178 exited]
[New LWP 179]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
 mutex 1 returned 0
 mutex 2 returned 0
Creating thread (deadlk)...
[LWP 179 exited]
[New LWP 180]
Thread releases the semaphore...
Cancel thread...
Thread canceled...

# freeze
^Z
Thread 1 "1-2.test" received signal SIGTSTP, Stopped (user).
0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00007fa2688fc6c7 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fa268940b84 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x0000000000000000 in ?? ()

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 10, 2021

I guess another way of putting it: Give what the test is measuring can you see any reason not to set returned to zero early on? To me it looked like it was intended that returned should be zero from the beginning and the fact that it was initialized to zero late is not intentional. returned being true is supposed to signify that the second pthread_mutex_lock has returned and it being set to 1 in a previous run was not supposed to effect the next run.

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 10, 2021

If you set VERBOSE=2 does that help remove the deadlock with alpine?

Great news that this change fixes the deadlock on alpine too.. I'm even more convinced now that this is the correct fix.

@kleisauke
Copy link
Collaborator

I guess another way of putting it: Give what the test is measuring can you see any reason not to set returned to zero early on? To me it looked like it was intended that returned should be zero from the beginning and the fact that it was initialized to zero late is not intentional. returned being true is supposed to signify that the second pthread_mutex_lock has returned and it being set to 1 in a previous run was not supposed to effect the next run.

Ah yes, you're right. It makes more sense to "reset" the returned var earlier to avoid any interference between the two threads.

If you set VERBOSE=2 does that help remove the deadlock with alpine?

I initially tested with -DVERBOSE=2, but it seems that it still freezes after the Thread canceled... print. Removing that definition doesn't work either.

Great news that this change fixes the deadlock on alpine too.. I'm even more convinced now that this is the correct fix.

Note that I was unable to resolve this deadlock on Alpine. I've tried this change-set, older versions, enabling/disabling verbose output to no avail. Perhaps it makes more sense to just disable these two tests?

We can also try to report this upstream in musl's mailing list, the segfault in posixtest.test_pthread_key_create_2_1 (see here) and posixtest.test_pthread_detach_1_2 (see reproducer instructions) should probably be reported as well.

@sbc100
Copy link
Contributor Author

sbc100 commented Nov 11, 2021

For now these two tests are disabled as "flaky" in emscripten .. so maybe I'll just leave it at that. I'll leave this PR open for posterity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants