sheep: cancel recovery if failed to fetch object list #371

tmenjo · 2017-02-27T09:03:36Z

At first stage of recovery, each sheep node sends GET_OBJ_LIST request to all other nodes in the cluster to prepare a list of the objects which should be recovered. The list can be incomplete if any of the requests failed. In such a case, the node should not send COMPLETE_RECOVERY notification to the cluster, or the cluster can lose some objects when all the nodes send that notification.

This commit resolves such an issue by "canceling" recovery, instead of "finishing" it, when any of the GET_OBJ_LIST requests failed. Once the recovery in a sheep cancelled, that sheep never send
COMPLETE_RECOVERY until another epoch-lifting recovery is started then completed. It also sends a DISABLE_RECOVER operation to the cluster to pause ongoing recovery in other nodes.

This commit also fixes #363.

Signed-off-by: Takashi Menjo <menjo.takashi@lab.ntt.co.jp>

At first stage of recovery, each sheep node sends GET_OBJ_LIST request to all other nodes in the cluster to prepare a list of the objects which should be recovered. The list can be incomplete if any of the requests failed. In such a case, the node should not send COMPLETE_RECOVERY notification to the cluster, or the cluster can lose some objects when all the nodes send that notification. This commit resolves such an issue by "canceling" recovery, instead of "finishing" it, when any of the GET_OBJ_LIST requests failed. Once the recovery in a sheep cancelled, that sheep never send COMPLETE_RECOVERY until another epoch-lifting recovery is started then completed. It also sends a DISABLE_RECOVER operation to the cluster to pause ongoing recovery in other nodes. This commit also fixes sheepdog#363. Signed-off-by: Takashi Menjo <menjo.takashi@lab.ntt.co.jp>

tmenjo · 2017-02-27T09:04:55Z

Please don't merge this PR yet. This is not tested yet. However, comments for log message are very welcome.

tmenjo · 2017-02-27T10:20:23Z

Some functional tests failed with diff like below:

-Cluster status: running, auto-recovery enabled
+Cluster status: running, auto-recovery disabled

I suspect that this patch causes the failure. I will fix soon.

vtolstov · 2017-12-19T12:57:45Z

any news for fixing this?

tmenjo self-assigned this Feb 27, 2017

furkanmustafa mentioned this pull request Feb 11, 2018

Many problems during recovery #425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sheep: cancel recovery if failed to fetch object list #371

sheep: cancel recovery if failed to fetch object list #371

tmenjo commented Feb 27, 2017

tmenjo commented Feb 27, 2017

tmenjo commented Feb 27, 2017

vtolstov commented Dec 19, 2017

sheep: cancel recovery if failed to fetch object list #371

Are you sure you want to change the base?

sheep: cancel recovery if failed to fetch object list #371

Conversation

tmenjo commented Feb 27, 2017

tmenjo commented Feb 27, 2017

tmenjo commented Feb 27, 2017

vtolstov commented Dec 19, 2017