Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rare case of infinite loop on Android #630

Open
delguoqing opened this issue Apr 1, 2024 · 9 comments
Open

rare case of infinite loop on Android #630

delguoqing opened this issue Apr 1, 2024 · 9 comments

Comments

@delguoqing
Copy link

Hi, maintainer,
I have encountered a problem of GC stuck in resend_lost_signals. Because it is a bit hard to reproduce this problem.
I will describe my findings as following:

1) assume that all threads are suspended in the GC_suspend_handler and haven't wake up from sigsuspend call.
2) Then GC then finished and want to restart all threads. It will begin by updating GC_stop_count to GC_stop_count + 1. Before it has a chance to raise signals to other threads, however, other threads might be wake up by other signals than the restart signal and find that GC_stop_count != my_stop_count, and then get out of the sigsuspend while loop. It will also set the last_stop_count of its thread to be GC_stop_count | 1.
3) After that, the GC thread will execute GC_restart_all, and decide that the thread metioned in 2) has already started, so it will not raise signal to it. And that thread is not counted into n_live_threads.
4) n_live_threads will be used in resend_lost_signals_retry, so the calling thread will call sem_timedwait "Nthread -1" times while GC_suspend_ack_sem will have a value of Nthread. That is when things start going wrong.
5) So if everything goes well, GCs after this point, can start before other threads have completely stopped. ( because it has 1 extra ack from previous rounds) That already causes trouble. The infinite loop case is just one of them. If, some signal comes in which takes a lot of time to execute, sem_timedwait will timeout, and go into the resend_lost_signal function. At that case, we might have ack_count > n_live_threads. I know we have handled the case of newly_sent < n_live_threads - ack_count due to lost of some threads. But if the opposite happens, the thread will be trapped infinitely, even if we have been successfully restart all threads, we still have ack_count greater than n_live_threads, and the loop will continue.

When Android is creating a bug report. It will send a SIGQUIT signal to "Signal Catcher" thread, the signal catcher thread will in turn send SIGRT_1 to all threads, to stop them and grab the ucontext to do stack unwind. That signal might wake up threads in sigsuspend, thus causing what I have described above.

Hope you can take some time, and give some comments. I'll try to provide any more information you need.
@ivmai
Copy link
Owner

ivmai commented Apr 1, 2024

Might be related: #396

@delguoqing
Copy link
Author

Similar to #396 , If GC_ASSERTIONS is enabled in my project, the process will abort at that

  sem_getvalue(&GC_suspend_ack_sem, &i);
  GC_ASSERT(0 == i);

#396 seems to be related to fork, this is not the case in my project though.

I have a fix for this problem and want to contribute, How I can do that?

@ivmai
Copy link
Owner

ivmai commented Apr 3, 2024

I have a fix for this problem and want to contribute

Cool. Just make a PR and reference it here.

@ivmai
Copy link
Owner

ivmai commented Apr 3, 2024

Probably the fix could be (not sure, I've not looked at it deeply):
} while (ao_load_acquire_async(&GC_stop_count) == my_stop_count
->
} while ((ao_load_acquire_async(&GC_stop_count) & ~(AO_t)THREAD_RESTARTED) == my_stop_count

Probably some other fix exist, let's see your version.

@delguoqing
Copy link
Author

Hello, I have created a pull request. #631
I have tried to make my best and described this problem clearer. Can you take a look and make some comments?

@ivmai
Copy link
Owner

ivmai commented Apr 7, 2024

Good, I will review it within a week.
Could you please use your real name in commit (--author)?
(Re-pushing the PR should also run the tests on Travis CI.)

@delguoqing
Copy link
Author

Hi, I have changed user.name and re-open a PR #633
Although I have commited using my real name, Github displays my github account name anyway.

@delguoqing
Copy link
Author

@ivmai anything I can do to move forward?

@ivmai
Copy link
Owner

ivmai commented May 9, 2024

No-no, sorry about delay((
(I'm doing big refactoring of the code base these days, it takes me longer than planned, and I just don't want to switch to deeper review of your patch. I hope I will complete it within May.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants