sheep: cancel recovery if failed to fetch object list #371
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At first stage of recovery, each sheep node sends GET_OBJ_LIST request to all other nodes in the cluster to prepare a list of the objects which should be recovered. The list can be incomplete if any of the requests failed. In such a case, the node should not send COMPLETE_RECOVERY notification to the cluster, or the cluster can lose some objects when all the nodes send that notification.
This commit resolves such an issue by "canceling" recovery, instead of "finishing" it, when any of the GET_OBJ_LIST requests failed. Once the recovery in a sheep cancelled, that sheep never send
COMPLETE_RECOVERY until another epoch-lifting recovery is started then completed. It also sends a DISABLE_RECOVER operation to the cluster to pause ongoing recovery in other nodes.
This commit also fixes #363.
Signed-off-by: Takashi Menjo <menjo.takashi@lab.ntt.co.jp>