Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with main.py #1

Open
chituma110 opened this issue Mar 19, 2019 · 11 comments
Open

Error with main.py #1

chituma110 opened this issue Mar 19, 2019 · 11 comments

Comments

@chituma110
Copy link

main.py
--train
--exp
lr7e-3
--epochs
50
--base_lr
0.007

raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size [1, 256, 1, 1]

@chenxi116
Copy link
Owner

This seems to happen when a particular data batch has size 1. Then BN breaks down.

Which dataset are you training on? And what is your batch size?

@chituma110
Copy link
Author

This seems to happen when a particular data batch has size 1. Then BN breaks down.

Which dataset are you training on? And what is your batch size?

dataset: pascal voc 2012
batchsize: 16

@chenxi116
Copy link
Owner

Interesting. I was using this code very recently, but didn't encounter this problem.

Does your problem occur for the first batch? Or the last batch of the epoch?

Also can you confirm len(dataset) == 10582?

@chituma110
Copy link
Author

The error occurred at the last iter of the first epoch.

@chenxi116
Copy link
Owner

You did not answer my last question, which is about dataset length.

You need to measure the last batch's batch size, which may not equal to 16. My guess is that it is somehow 1, which creates the error.

@chituma110
Copy link
Author

QQ截图20190320145244
I just ran the code, the len(dataset) == 10582

@chituma110
Copy link
Author

Interesting! When I changed the CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 to CUDA_VISIBLE_DEVICES=4,5,6,7. the code ran successfully....

@chituma110
Copy link
Author

QQ截图20190320223533

interesting. after running the main.py with 4 GPUs the mean IOU is only 75.75% not 77.14%...

@chenxi116
Copy link
Owner

I recommend training with one GPU. This is because the "DataParallel" in pytorch does not synchronize BN statistics. So by using 4 GPUs with batch size 16, you are effectively computing BN statistics using batch size 16/4 = 4, and we all know the larger the batch size, the better BN gets.

I have used this code recently, and if you use one GPU, this number should be at least 76.50%.

@XUYUNYUN666
Copy link

QQ截图20190320223533

interesting. after running the main.py with 4 GPUs the mean IOU is only 75.75% not 77.14%...

Can i have your qq number , I really want your help, thank u very much

@ShristiDasBiswas
Copy link

Hi, could you tell me the hyperparameters you used for training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants