Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new osiris_log:tail(Directory) function. #146

Open
kjnilsson opened this issue Oct 9, 2023 · 1 comment
Open

Add new osiris_log:tail(Directory) function. #146

kjnilsson opened this issue Oct 9, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@kjnilsson
Copy link
Contributor

kjnilsson commented Oct 9, 2023

Currently the stream coordinator uses the osiris_log:overview/1 function to get the tail of each member during the writer selection process. This function does unnecessary index scanning work that the coordinator does not make use of. instead osiris_log:tail/1 would only get the last valid {Epoch, ChunkId} which should be much more efficient.

in addition it will return a "dirty" indicator to signal if it could detect that the server node the member was running on was shut down uncleanly such that the unwritten part of the page cache was lost. This indicator can be used by the stream coordinator to adjust the selectable set to avoid electing members that do not have all the

This indicator can never be 100% accurate but it should be able to catch the most common page cache loss scenarios by doing the following checks.

If the last index record has no trailing data itself and points to a valid chunk in the segment (CRC passes) and there is at most 1 trailing (but valid) chunk the osiris log is considered ok. During normal operations the chunk is written before the index entry and the rabbitmq process could crash in between these two events which is why a single valid trailing chunk in the segment is not indicative of page cache loss. Two or more chunks would however indicate that blocks pointing to the index file were never flushed correctly.

An empty index where there is no corresponding segment file is also considered ok as when a segment fills an we need to open a new one the index file is written first. If there is a segment file (empty or not) and the index does not even have it's index header we consider this indicative of page cache loss.

The most common scenario indicating page cache loss is most likely index record pointing to missing segment data but this check may well need to be refined and evolve over time and based on actual testing with different real file systems and storage types.

@kjnilsson kjnilsson added the enhancement New feature or request label Oct 9, 2023
@kjnilsson
Copy link
Contributor Author

kjnilsson commented Oct 9, 2023

It's important to note that this check only works once. After an osiris log has been initialised against the directory the files will have any dangling / trailing entries truncated and next time it will return ok even if the member may not have caught up with where it was before the page cache was lost.

Still it adds more steps to recreate failure scenarios, pushing the boat farther out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant