Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate test_basebackup_with_high_slru_count failure #7586

Open
jcsp opened this issue May 2, 2024 · 5 comments
Open

Investigate test_basebackup_with_high_slru_count failure #7586

jcsp opened this issue May 2, 2024 · 5 comments
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Contributor

jcsp commented May 2, 2024

No description provided.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing labels May 2, 2024
@VladLazar
Copy link
Contributor

I looked at the failure Joonas linked. It's a failure to start-up after a SIGQUIT stop for both the ps and the sk (we do that as part of the test setup).

Upon starting both the ps and sk seem to be stuck waiting for their PID file flocks (based on logs). The fd for the pidfile should have been dropped and, in any case, we shouldn't block on the lock since we use LOCK_NB. Must be missing something.

@VladLazar
Copy link
Contributor

VladLazar commented May 3, 2024

@jcsp
Copy link
Contributor Author

jcsp commented May 13, 2024

This week:

  • eyeball code again
  • add more logs

@jcsp
Copy link
Contributor Author

jcsp commented May 23, 2024

Seen in this benchmark run: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9204710983/index.html

There are two possibly-related failures:

  • test_basebackup_with_high_slru_count is timing out on pageserver startup, where the pageserver log prints the version and then nothing else: this may mean it is stuck on its flock call for a pidfile.
  • test_download_churn is timing out on pageserver shutdown, where the pageserver prints that it has handled the signal, but neon_local does not see the process disappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants