Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only return 1 class when training COVIDNet with COVIDx4 dataset #182

Open
gwang-kim opened this issue Jul 21, 2021 · 5 comments
Open

Only return 1 class when training COVIDNet with COVIDx4 dataset #182

gwang-kim opened this issue Jul 21, 2021 · 5 comments

Comments

@gwang-kim
Copy link

Description

Only return 1 class when finetuning COVIDNet and the loss is exploded.

Steps to Reproduce

I downloaded the COVIDx4 Dataset and tried training COVIDNet with COVIDx4 dataset.
However, when inference, the model returned only one class and the performance was poor.
I reported the loss every step during the training and the loss was exploded to several thousands after 1 epoch.
I try both training from scratch and fine-tuning.
How can I train your model stably?

Expected behavior

the model is trained stably

Actual behavior

the model returned only one class and the performance was poor.

Environment

Ubuntu 18.04
tensorflowgpu 1.15
And I followed the requirements.txt

@sabuj7177
Copy link

Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.

This is the training log for starting epochs:

Output: ./output/COVIDNet-lr0.0002
Dataset length
15952
13794 2158
Saved baseline checkpoint
Baseline eval:
[[194. 6.]
[ 9. 191.]]
Sens Negative: 0.970, Positive: 0.955
PPV Negative: 0.956, Positive: 0.970
Training started
1725/1725 [==============================] - 2445s 1s/step
Epoch: 0001 Minibatch loss= 370.629089355
[[200. 0.]
[200. 0.]]
Sens Negative: 1.000, Positive: 0.000
PPV Negative: 0.500, Positive: 0.000
Saving checkpoint at epoch 1
1725/1725 [==============================] - 4858s 3s/step
Epoch: 0002 Minibatch loss= 3805.902343750
[[199. 1.]
[200. 0.]]
Sens Negative: 0.995, Positive: 0.000
PPV Negative: 0.499, Positive: 0.000
Saving checkpoint at epoch 2
1725/1725 [==============================] - 7348s 4s/step
Epoch: 0003 Minibatch loss= 12214.270507812
[[195. 5.]
[199. 1.]]
Sens Negative: 0.975, Positive: 0.005
PPV Negative: 0.495, Positive: 0.167
Saving checkpoint at epoch 3
1725/1725 [==============================] - 9727s 6s/step
Epoch: 0004 Minibatch loss= 28461.550781250
[[200. 0.]
[200. 0.]]
Sens Negative: 1.000, Positive: 0.000
PPV Negative: 0.500, Positive: 0.000
Saving checkpoint at epoch 4

This is the command i used(it is from training instruction):

python train_tf.py
--weightspath models/COVIDNet-CXR-2
--metaname model.meta
--ckptname model
--n_classes 2
--trainfile labels/train_COVIDx8B.txt
--testfile labels/test_COVIDx8B.txt
--out_tensorname norm_dense_2/Softmax:0
--logit_tensorname norm_dense_2/MatMul:0

Environment:
Ubuntu 20.04LTS
tensorflow-gpu

I build the dataset by following the dataset generation instructions.

Thank you.

@gwang-kim
Copy link
Author

gwang-kim commented Jul 27, 2021

Hi @sabuj7177,
Oh, it's almost the same situation as mine.
Unfortunately, I didn't found any solution.. I controlled the hyperparams such as LR, but it didn't work.

If you solve the problem, please let me know!

Thank you

@sabuj7177
Copy link

Hi @lindawangg @haydengunraj,
Can you please suggest any workaround of this issue? Can you please suggest what I am doing wrong?

@SmallFan7
Copy link

Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.

This is the training log for starting epochs:

Output: ./output/COVIDNet-lr0.0002 Dataset length 15952 13794 2158 Saved baseline checkpoint Baseline eval: [[194. 6.] [ 9. 191.]] Sens Negative: 0.970, Positive: 0.955 PPV Negative: 0.956, Positive: 0.970 Training started 1725/1725 [==============================] - 2445s 1s/step Epoch: 0001 Minibatch loss= 370.629089355 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 1 1725/1725 [==============================] - 4858s 3s/step Epoch: 0002 Minibatch loss= 3805.902343750 [[199. 1.] [200. 0.]] Sens Negative: 0.995, Positive: 0.000 PPV Negative: 0.499, Positive: 0.000 Saving checkpoint at epoch 2 1725/1725 [==============================] - 7348s 4s/step Epoch: 0003 Minibatch loss= 12214.270507812 [[195. 5.] [199. 1.]] Sens Negative: 0.975, Positive: 0.005 PPV Negative: 0.495, Positive: 0.167 Saving checkpoint at epoch 3 1725/1725 [==============================] - 9727s 6s/step Epoch: 0004 Minibatch loss= 28461.550781250 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 4

This is the command i used(it is from training instruction):

python train_tf.py --weightspath models/COVIDNet-CXR-2 --metaname model.meta --ckptname model --n_classes 2 --trainfile labels/train_COVIDx8B.txt --testfile labels/test_COVIDx8B.txt --out_tensorname norm_dense_2/Softmax:0 --logit_tensorname norm_dense_2/MatMul:0

Environment: Ubuntu 20.04LTS tensorflow-gpu

I build the dataset by following the dataset generation instructions.

Thank you.

I have the same situation as you. Is the problem solved now?

@gwang-kim
Copy link
Author

@SmallFan7 Not yet, I think it's just the limitation of this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants