The shuffle manager should restore the previously managed workers when re-election #8

jlon · 2021-12-06T09:03:01Z

The shuffle manager should restore the previously managed workers when re-electing the master. Otherwise, in the next heartbeat cycle, the job will not be available when the worker is requested, causing the job to fail. We should minimize the impact of shuffle manager failures on operations.

wsry · 2021-12-06T09:22:28Z

@jlon Do you mean that ShuffleManager should persist its state like all ShuffleWorkers and recover the state when it restarts? Does that mean we need depend on a reliable external storage?

jlon · 2021-12-08T03:43:43Z

@wsry I want to contribute this feature，can I?

wsry · 2021-12-08T05:48:33Z

@jlon I am not sure if I understand your concern correctly, but I have some concerns about persisting and recovering ShuffleManager state because it may introduce extra complexity (relying on external storage?). I wonder if the ShuffleManager standby solution is better? The standby solution can also enhance standalone deployment which means we do not always rely on the external system (YARN, K8s) to start up the new ShuffleManager instance.

wsry · 2021-12-08T05:50:03Z

@jlon BTW, I have sent a friend request on DingTalk, we can also discuss it offline.

gaoyunhaii · 2021-12-08T05:54:04Z

In the next heartbeat cycle, the job will not be available when the worker is requested, causing the job to fail.

I have one more point to complement is that I think we might have to relies on retrying to solve this issue, unless we could ensures we always have an online shuffle manager at any time, which might not be able to be guaranteed even if we have persist storage ?

jlon · 2021-12-10T10:05:49Z

@gaoyunhaii In k8s mode, when ShuffleManager is relaunched, we can query the list of pods (workers) under a fixed label through the Api Server of k8s. At the same time, we can also know the ip of each worker pod, so we can actively add to the list Each worker actively asks for a heartbeat. In this way, the previously managed workers can be restored in time, but I haven't thought of how to query the previous container in the yarn environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The shuffle manager should restore the previously managed workers when re-election #8

The shuffle manager should restore the previously managed workers when re-election #8

jlon commented Dec 6, 2021

wsry commented Dec 6, 2021

jlon commented Dec 8, 2021

wsry commented Dec 8, 2021

wsry commented Dec 8, 2021

gaoyunhaii commented Dec 8, 2021

jlon commented Dec 10, 2021

The shuffle manager should restore the previously managed workers when re-election #8

The shuffle manager should restore the previously managed workers when re-election #8

Comments

jlon commented Dec 6, 2021

wsry commented Dec 6, 2021

jlon commented Dec 8, 2021

wsry commented Dec 8, 2021

wsry commented Dec 8, 2021

gaoyunhaii commented Dec 8, 2021

jlon commented Dec 10, 2021