Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to converge #60

Open
iyu-Fang opened this issue Oct 1, 2020 · 9 comments
Open

Failed to converge #60

iyu-Fang opened this issue Oct 1, 2020 · 9 comments

Comments

@iyu-Fang
Copy link

iyu-Fang commented Oct 1, 2020

Hi,
Thank you for your work.

When I tried to train my model according to the parameters you provided in configs.yaml, I found that I could not reproduce your result in the Market dataset. If I use the visual_tools to show the rainbow image, the generated image will produce the wrong color (the color is even different from the input images). After that, I've checked the loss in the tensorboard. I found that the total loss, as well as id loss, surged to very high values at 30k iteration. And then they could never converge. However, I downloaded the best model you provided and tested it in the same way, it works well.

Please give me some advice.

@layumi
Copy link
Contributor

layumi commented Oct 4, 2020

Hi @iyu-Fang
At 30k iteration we applied the teacher loss (gradually).

if iteration > hyperparameters['warm_teacher_iter']:

  1. Do you change any parameter in configs, such as batch size or lrRate?
  2. You may try tune down the lrRate.
  3. Check the teacher model. Does it work good?

@iyu-Fang
Copy link
Author

iyu-Fang commented Oct 9, 2020

Hi @layumi
Thank you for your advice.

Since I did not modify the batch size, I've tried to check the performance of the teacher model. I test the performance of the model (best) you provide, I got 0.81 Rank@1 and 0.54 mAP. However, even if I retrain a new teacher model and it works well, DG-Net still can not converge. BTW, I've set the max_teacher_w to 0.2, it still works badly so far.

I will be appreciated if you could give me some further suggestions.

@layumi
Copy link
Contributor

layumi commented Oct 9, 2020

Hi @iyu-Fang
The teacher model performance is not right. Please check the version of your numpy.

https://github.com/layumi/Person_reID_baseline_pytorch#prerequisites
Some reports found that updating numpy can arrive the right accuracy. If you only get 50~80 Top1 Accuracy, just try it. We have successfully run the code based on numpy 1.12.1 and 1.13.1 .

@iyu-Fang
Copy link
Author

I thought that‘s not the problem. My numpy version is 1.19.1. Or could you tell me the exact version of your environment(numpy, pytorch, etc.) when you run your experiments?

@layumi
Copy link
Contributor

layumi commented Oct 10, 2020

Hi @iyu-Fang
Could you try to run https://github.com/layumi/Person_reID_baseline_pytorch and check the result?

@iyu-Fang
Copy link
Author

@layumi
Actually, that's exactly how I tested. The best model got 0.810 Rank1 and 0.543 mAP, while the re-trained model (ResNet-50(all tricks)) tested 0.914 Rank1 and 0.778 mAP. But even though I use the re-trained model as my teacher model, DG-Net still cannot converge.

@layumi
Copy link
Contributor

layumi commented Oct 12, 2020

@iyu-Fang Did you run the model on Market-1501 or other datasets?
Do you load the model config correctly?

@iyu-Fang
Copy link
Author

@layumi Thank you for your quick response. Yes, I run the model on Market-1501 dataset. As for the config, the best model you provide does not use_NAS parameter, so I added it to the config and set it false. Nothing else was changed.

@layumi
Copy link
Contributor

layumi commented Oct 12, 2020

@iyu-Fang
The teacher model should achieve about 89.6% Rank@1 and 74.5% mAP. I am not sure whether there are any other difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants