New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix head stats and hooks when replaying a corrupted snapshot #14079
base: main
Are you sure you want to change the base?
Fix head stats and hooks when replaying a corrupted snapshot #14079
Conversation
8b7f8e5
to
f8d34ad
Compare
Signed-off-by: alanprot <alanprot@gmail.com>
f8d34ad
to
6ee4190
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks make sense to me.
In Cortex, we rely on series lifecycle callback to keep track of active series. We hit this series double counting bug when getting a corrupted chunk and active series reached limit because post delete hook was not called.
15e7114
to
cb8c835
Compare
Signed-off-by: alanprot <alanprot@gmail.com>
cb8c835
to
61bf6b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this; looks fairly good but some comments from the bug-scrub meeting.
totalSeries += len(deletedForCallback) | ||
deletedFromPrevStripe = len(deletedForCallback) | ||
} | ||
s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a no-op because it is overwritten on line 319?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. it is.. i just wanted to do that for correctness of the "reset" method - reset seems that we are resetting the struct to the initial state (empty) and could be reused. WDYT?
I can no do that and rename the method to something else (maybe flush? clean?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code:
- Renamed the method from reset to flush
- Removed the extra logic to clean the series
tsdb/head.go
Outdated
func (s *stripeSeries) reset() int { | ||
deletedFromPrevStripe := 0 | ||
totalSeries := 0 | ||
for i := 0; i < s.size; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this code be shared with gc()
around line 1960?
I.e. via some refactoring. To avoid diverging bugs going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. I created a s.iter
method that is now shared btw both methods.
@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) { | |||
require.NoError(t, err) | |||
f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0) | |||
require.NoError(t, err) | |||
_, err = f.WriteAt([]byte{0b11111111}, 18) | |||
// lets corrupt middle of the snapshot, so we can replay some entries | |||
_, err = f.WriteAt([]byte{0b11111111}, 300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change from 18 to 300 significant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. it is..
If we corrupt the byte 18, we will not be able to restore any series (and so we cannot see the problem).
Corrupting the byte 300, we are able to restore 2 timeseries before reaching the corrupted position, and so, highlight the problem.
Signed-off-by: alanprot <alanprot@gmail.com>
0b2ec4b
to
70a26fd
Compare
@bboreham PTAL? |
@codesome is this something you could help review? |
When loading a snapshot and encountering a corrupted chunk, we discard previously loaded series from the snapshot and resort to replaying the wall. In such cases, we were not resetting the number of series in the head, leading to double counting them.
Additionally, we did not invoke the PostDeletion hook when resetting the memory - this needs to be called as the PostCreation was called for the series which we were able to replay from the snapshot but were subsequently discarded.