Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FW/Unix: fix handling of signals in wait_for_children() #430

Conversation

thiagomacieira
Copy link
Contributor

The rewrite in commit 37cb3e5 for wait_for_children() accidentally broke the handling of interruptions by signals on Unix systems, which exists to ensure all the processes get interrupted if the parent process is.

Handling of SIGHUP and SIGTERM was easy, since single_wait() was already returning the correct signal number, but wait_for_all_children() wasn't acting on it. Handling of SIGINT required remembering we had been interrupted once.

Fixes #429.

@thiagomacieira thiagomacieira force-pushed the FW_Unix_fix_handling_of_signals_in_wait_for_children_ branch 2 times, most recently from 83044d6 to fd92fe3 Compare October 19, 2023 22:35
jposwiata
jposwiata previously approved these changes Oct 23, 2023
Copy link
Contributor

@jposwiata jposwiata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 'quick fix' to handle 'immediate' signals. Fix is valid.
As there's no SIGINT handling ("overall"), more discussion/work is required (best would be another tasks).

@dawid-sabat
Copy link
Contributor

I verified this fix on my end as well.

kill -$signal $pid
if wait $!; then
# Confirm it was the correct signal
(( $? = 128 | $(kill -l $signal)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this needs to be ==

@thiagomacieira thiagomacieira force-pushed the FW_Unix_fix_handling_of_signals_in_wait_for_children_ branch 2 times, most recently from 5dae0d5 to 0ad43c0 Compare October 26, 2023 05:36
@dawid-sabat
Copy link
Contributor

dawid-sabat commented Oct 26, 2023

I see a problem when we run opendcdiag with --service parameter. In such case the framework doesn't quit gracefully.

It looks like to reproduce it we have to wait until --service ends running its batch of the tests and goes to sleep.

We might have found this issue #157

Comment on lines 1546 to 1551
if wait $!; then
# Confirm it was the correct signal
(( $? == (128 | $(kill -l $signal))))
else
echo >&2 'Test did not exit!'
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this valid?
wait waits forever (while process is active). It returns the status of background command, in our case it is expected that "signal" (128 + sig) is returned.
I don't understand :1548 purpose, is -e assumed to fail the script?
I rather expect something like:

wait $pid
stat=$?
if [ $stat != $((128 | $(kill -l $signal))) ]; then
    echo "Process ${pid} exited with ${stat}" 1>&2
    exit 1
fi

Is wait with timeout required here, or does .bats handling cover timeouts with upper layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed looks wrong. We're probably running the else code all the time, but because this is passing, I didn't notice it. Weirdly, I can't debug this because if I add a false after this if so it will fail, bats fails completely:

1..3
# bats warning: Executed 0 instead of expected 3 tests

Changing to fd 3 does show that we're running on the wrong side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bats-core/bats-core#364 - I need to remove the trap

framework/sandstone.cpp Show resolved Hide resolved
The rewrite in commit 37cb3e5 for
wait_for_children() accidentally broke the handling of interruptions by
signals on Unix systems, which exists to ensure all the processes get
interrupted if the parent process is.

Handling of SIGHUP and SIGTERM was easy, since single_wait() was already
returning the correct signal number, but wait_for_all_children() wasn't
acting on it. Handling of SIGINT required remembering we had been
interrupted once.

Fixes opendcdiag#429.

[ChangeLog][Framework] Fixed a regression in handling of Ctrl+C or Unix
signals sent to the main process, which caused the tool to attempt to
continue running instead of exiting immediately.

Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
They're used now (have been since the `wait_for_children()` rewrite in
commit 37cb3e5).

Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
@thiagomacieira thiagomacieira force-pushed the FW_Unix_fix_handling_of_signals_in_wait_for_children_ branch from 0ad43c0 to f15bf3a Compare October 26, 2023 15:56
@jposwiata
Copy link
Contributor

I see a problem when we run opendcdiag with --service parameter. In such case the framework doesn't quit gracefully.

This is related to signal handling in waiting. Fixed with #439

Copy link
Contributor

@jposwiata jposwiata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thiagomacieira thiagomacieira merged commit b767456 into opendcdiag:main Oct 27, 2023
5 checks passed
@thiagomacieira thiagomacieira deleted the FW_Unix_fix_handling_of_signals_in_wait_for_children_ branch October 27, 2023 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Framework restarts the test after a signal is received
3 participants