Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test flakes #275

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fix test flakes #275

wants to merge 3 commits into from

Conversation

ferglor
Copy link
Collaborator

@ferglor ferglor commented Sep 14, 2023

In this PR, we're trying to rectify test flakes in the recoverer

From testing the old code, I wasn't sure of the expected behaviour of the recoverer, specifically:

  • When the underlying service wrapped by the recoverer errors, should we recover and restart the underlying service, or should we log the error and stop attempting to re-run the underlying service
  • When the underlying service wrapped by the recoverer completes successfully (returns a nil error), presumably we want to stop the recoverer at this point?

In the previous implementation of the recoverer, if the wrapped service completed execution without error, the recoverer would block until it was explicitly closed by a client. In this version, the recoverer shuts down once the wrapped service completes without error.

Also in the previous implementation of the recoverer, if the wrapped service errored, the recoverer would do nothing and block until it was explicitly closed by a client. In this version, the recoverer tries to recover the wrapped service when the wrapped service errors.

The reason I wasn't really clear what should happen, is because the old tests we had seemed to imply similar functionality for panic and error scenarios, but the behaviour defined in the implementation didn't really align with that?

Either way, it's easy to have this recover on panic and errors, or one or the other. I've ran these new tests thousands of times locally and I'm not seeing the deadlock anymore; previously, I saw the deadlock in around 1 in 500 runs.

@cl-sonarqube-production
Copy link

@ferglor ferglor marked this pull request as ready for review September 15, 2023 11:05
)

const (
panicRestartWait = 10 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we discuss if we really want to recover from panics instead of failing hard? Is this a pattern used in other places in core node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants