Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix head stats and hooks when replaying a corrupted snapshot #14079

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alanprot
Copy link
Contributor

@alanprot alanprot commented May 10, 2024

When loading a snapshot and encountering a corrupted chunk, we discard previously loaded series from the snapshot and resort to replaying the wall. In such cases, we were not resetting the number of series in the head, leading to double counting them.

Additionally, we did not invoke the PostDeletion hook when resetting the memory - this needs to be called as the PostCreation was called for the series which we were able to replay from the snapshot but were subsequently discarded.

@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 3 times, most recently from 8b7f8e5 to f8d34ad Compare May 10, 2024 23:19
Signed-off-by: alanprot <alanprot@gmail.com>
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from f8d34ad to 6ee4190 Compare May 10, 2024 23:23
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks make sense to me.

In Cortex, we rely on series lifecycle callback to keep track of active series. We hit this series double counting bug when getting a corrupted chunk and active series reached limit because post delete hook was not called.

@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 2 times, most recently from 15e7114 to cb8c835 Compare May 13, 2024 17:43
Signed-off-by: alanprot <alanprot@gmail.com>
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from cb8c835 to 61bf6b6 Compare May 13, 2024 18:19
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this; looks fairly good but some comments from the bug-scrub meeting.

totalSeries += len(deletedForCallback)
deletedFromPrevStripe = len(deletedForCallback)
}
s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a no-op because it is overwritten on line 319?

Copy link
Contributor Author

@alanprot alanprot May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. it is.. i just wanted to do that for correctness of the "reset" method - reset seems that we are resetting the struct to the initial state (empty) and could be reused. WDYT?

I can no do that and rename the method to something else (maybe flush? clean?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code:

  • Renamed the method from reset to flush
  • Removed the extra logic to clean the series

tsdb/head.go Outdated
func (s *stripeSeries) reset() int {
deletedFromPrevStripe := 0
totalSeries := 0
for i := 0; i < s.size; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this code be shared with gc() around line 1960?
I.e. via some refactoring. To avoid diverging bugs going forward.

Copy link
Contributor Author

@alanprot alanprot May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. I created a s.iter method that is now shared btw both methods.

@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) {
require.NoError(t, err)
f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0)
require.NoError(t, err)
_, err = f.WriteAt([]byte{0b11111111}, 18)
// lets corrupt middle of the snapshot, so we can replay some entries
_, err = f.WriteAt([]byte{0b11111111}, 300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change from 18 to 300 significant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. it is..

If we corrupt the byte 18, we will not be able to restore any series (and so we cannot see the problem).

Corrupting the byte 300, we are able to restore 2 timeseries before reaching the corrupted position, and so, highlight the problem.

Signed-off-by: alanprot <alanprot@gmail.com>
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from 0b2ec4b to 70a26fd Compare May 14, 2024 16:24
@alanprot
Copy link
Contributor Author

@bboreham PTAL?

@alanprot alanprot requested a review from bboreham May 21, 2024 17:11
@jeromeinsf
Copy link

@codesome is this something you could help review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants