Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sheep: cancel recovery if failed to fetch object list #371

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tmenjo
Copy link
Contributor

@tmenjo tmenjo commented Feb 27, 2017

At first stage of recovery, each sheep node sends GET_OBJ_LIST request to all other nodes in the cluster to prepare a list of the objects which should be recovered. The list can be incomplete if any of the requests failed. In such a case, the node should not send COMPLETE_RECOVERY notification to the cluster, or the cluster can lose some objects when all the nodes send that notification.

This commit resolves such an issue by "canceling" recovery, instead of "finishing" it, when any of the GET_OBJ_LIST requests failed. Once the recovery in a sheep cancelled, that sheep never send
COMPLETE_RECOVERY until another epoch-lifting recovery is started then completed. It also sends a DISABLE_RECOVER operation to the cluster to pause ongoing recovery in other nodes.

This commit also fixes #363.

Signed-off-by: Takashi Menjo <menjo.takashi@lab.ntt.co.jp>

At first stage of recovery, each sheep node sends GET_OBJ_LIST
request to all other nodes in the cluster to prepare a list of
the objects which should be recovered. The list can be incomplete
if any of the requests failed. In such a case, the node should not
send COMPLETE_RECOVERY notification to the cluster, or the cluster
can lose some objects when all the nodes send that notification.

This commit resolves such an issue by "canceling" recovery, instead
of "finishing" it, when any of the GET_OBJ_LIST requests failed.
Once the recovery in a sheep cancelled, that sheep never send
COMPLETE_RECOVERY until another epoch-lifting recovery is started
then completed. It also sends a DISABLE_RECOVER operation to the
cluster to pause ongoing recovery in other nodes.

This commit also fixes sheepdog#363.

Signed-off-by: Takashi Menjo <menjo.takashi@lab.ntt.co.jp>
@tmenjo tmenjo self-assigned this Feb 27, 2017
@tmenjo
Copy link
Contributor Author

tmenjo commented Feb 27, 2017

Please don't merge this PR yet. This is not tested yet. However, comments for log message are very welcome.

@tmenjo
Copy link
Contributor Author

tmenjo commented Feb 27, 2017

Some functional tests failed with diff like below:

-Cluster status: running, auto-recovery enabled
+Cluster status: running, auto-recovery disabled

I suspect that this patch causes the failure. I will fix soon.

@vtolstov
Copy link
Contributor

any news for fixing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Object loss after recovery cancelled on hetero-disk, auto-vnodes and avoiding-diskfull cluster
2 participants