Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why need time.sleep(2) in EpochBasedRunner ? when the deadlock will happen ? #1640

Open
Bilibilee opened this issue Jan 3, 2022 · 5 comments
Labels

Comments

@Bilibilee
Copy link
Contributor

Bilibilee commented Jan 3, 2022

why need time.sleep(2) in https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/epoch_based_runner.py#L46

the statement says: Prevent possible deadlock during epoch transition.

when possible deadlock will happen. and more impotant , time.sleep(2) is not elegant and flexible operation. It waste time in small dataset. I thick it is necessary to redesign.

@Bilibilee Bilibilee changed the title hi,I have no idea about the time.sleep(2) in EpochBasedRunner why need time.sleep(2) in EpochBasedRunner ? when the deadlock will happen ? Jan 3, 2022
@zhouzaida
Copy link
Member

zhouzaida commented Jan 4, 2022

Hi, this is a workaround to resolve the possible deadlock in dataloader. More details can be found at pytorch/pytorch#1355 (comment). We will find out a more elegant way to resolve the problem.

image

@JihwanEom
Copy link

JihwanEom commented Jul 8, 2022

Hi @zhouzaida,
Is the deadlock can also occur with single gpu training? I'm wondering about the root cause of inserting time.sleep(2) in mmcv.

@zhouzaida
Copy link
Member

Hi @zhouzaida, Is the deadlock can also occur with single gpu training? I'm wonder about the root cause of inserting time.sleep(2) in mmcv.

Hi @JihwanEom , in most cases the deadlock will not happen so this line can be removed from your local mmcv which will speed up your training.

@JihwanEom
Copy link

Okay, I got it. But could you explain the expected situation for possible deadlock?
I want to resolve this by analyzing the root cause. I can't imagine when removing time.sleep(2) can be dangerous for deadlock.

@supersoob
Copy link

supersoob commented Nov 8, 2022

Hi, @zhouzaida. As @JihwanEom said, I wonder which cases the deadlock can occur when removing time.sleep(2). Even I trained the multi-gpu training disabling time.sleep(2), I couldn't find the deadlock issue and worked well. I'm trying to get benefits from multi-gpu training but sleep(2) seems like bottleneck for overall training time. Could you please explain the specific scenario(like dataset/pipeline) and the possibility that deadlock may happen if sleep is removed? and I also wonder why this exact linetime.sleep(2) resolve all this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants