Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

satellite/overlay decrease NodeCheckInWaitPeriod #6795

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

littleskunk
Copy link
Member

What: Decrease NodeCheckInWaitPeriod from 2h down to 1h10m

Why: NodeCheckInWaitPeriod is designed to suppress unnecessary DB commits. At the same time, we want to allow the storage node to update its last_contact_success once every 2 hours. It turns out as long as the value is set to 2 hours the storage node has a high chance to get last_contact_success updates once every 3 hours because the second check-in is almost perfect on the 2 hour mark. This is a bit unfair. It means with just 1 hour of downtime the storage node might end up with an last_contact_success that is more than 4 hours old. At that point repair kicks in and moves pieces away from the node.

Reducing NodeCheckInWaitPeriod to 1h10m will make sure the node is allowed to checkin every 2 hours and even accounts for a possible restart after 1h30m or so. It would still commit the node checkin.

Please describe the tests:

  • Test 1:
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • NEW: Are there any Satellite database migrations? Are they forwards and backwards compatible?
  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

@cla-bot cla-bot bot added the cla-signed label Feb 21, 2024
@onionjake
Copy link

Does it already have jitter as well? Should probably have some random jitter to prevent thundering herds?

@littleskunk
Copy link
Member Author

Does it already have jitter as well? Should probably have some random jitter to prevent thundering herds?

Partially yes and partially no but in any case that is not connected to my pull request. I don't change any of that code. If too many nodes checkin all at the same time that might cause problems no matter what the config value is set to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants