Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train loss and validation loss tending to 0 #494

Closed
Ridhwanluthra opened this issue Dec 30, 2017 · 15 comments
Closed

train loss and validation loss tending to 0 #494

Ridhwanluthra opened this issue Dec 30, 2017 · 15 comments

Comments

@Ridhwanluthra
Copy link

I added the functionality to print validation loss based on the code given by #264 (comment)

  • I have a single label in my problem
  • I have edited the cfg file correctly with classes as 1 and filters in second last layer as 30
  • Even before these changes the training loss was tending to 0, I though it was over fitting but adding validation loss proved that it is not that.
  • There are no bounding boxes generated.
  • The validation loss does not affect training.
  • A snapshot of the losses, moving ave loss is training moving average loss and val one is for validation
step 2163 - moving ave loss 1.31846549355e-05 - val moving ave loss 1.35019140267e-05
step 2164 - moving ave loss 1.311464045e-05 - val moving ave loss 1.33585831669e-05
step 2165 - moving ave loss 1.31301993879e-05 - val moving ave loss 1.33837859452e-05
step 2166 - moving ave loss 1.3074759692e-05 - val moving ave loss 1.32738299211e-05
step 2167 - moving ave loss 1.30020373768e-05 - val moving ave loss 1.32209596895e-05
step 2168 - moving ave loss 1.2891638246e-05 - val moving ave loss 1.31455781394e-05
step 2169 - moving ave loss 1.28760445816e-05 - val moving ave loss 1.31389315505e-05
step 2170 - moving ave loss 1.28287536095e-05 - val moving ave loss 1.30365870197e-05
step 2171 - moving ave loss 1.29534266191e-05 - val moving ave loss 1.29940787821e-05
step 2172 - moving ave loss 1.29592243579e-05 - val moving ave loss 1.30160897623e-05

any help would be greatly appreciated
Thank you

@gangooteli
Copy link

you also have to provide labels.txt as argument.
In that file, you need to include only one class which you are using.
I also trained with 1 class it was working fine.

@Ridhwanluthra
Copy link
Author

@gangooteli I had provided labels.txt, also if there would be an inconsistency in that the training would not have started.

@onurbarut
Copy link

@gangooteli I am also trying to train with only one class from scratch. So far I was trying to train tiny-yolo. and until 400k steps with 1k epoch I only obtained loss around 7. How many steps would it take to make a good train, i.e. loss below 1?

@gangooteli
Copy link

@onurbarut I ran it on GTSRB dataset. I used pre-trained weights and trained on top of that. It took me around 200 epochs to converge.
Use of pre-trained weights will make it faster to converge.

@Ridhwanluthra
Here is my sample logs
step 987 - loss 0.6274503469467163 - moving ave loss 0.536738667892405
step 988 - loss 0.45404353737831116 - moving ave loss 0.5284691548409957
step 989 - loss 0.444408118724823 - moving ave loss 0.5200630512293785
step 990 - loss 0.48710036277770996 - moving ave loss 0.5167667823842116
step 991 - loss 0.2570722997188568 - moving ave loss 0.4907973341176761
step 992 - loss 0.37810787558555603 - moving ave loss 0.47952838826446414
step 993 - loss 0.6285824775695801 - moving ave loss 0.49443379719497577
step 994 - loss 0.40015703439712524 - moving ave loss 0.4850061209151907
step 995 - loss 0.2761436104774475 - moving ave loss 0.46411986987141635
step 996 - loss 0.23099832236766815 - moving ave loss 0.44080771512104155
step 997 - loss 0.2307831346988678 - moving ave loss 0.41980525707882416
step 998 - loss 0.5912097096443176 - moving ave loss 0.4369457023353735
step 999 - loss 0.6355569958686829 - moving ave loss 0.45680683168870445
step 1000 - loss 0.3976811468601227 - moving ave loss 0.4508942632058463
Checkpoint at step 1000
Finished saving checkpoint
VALIDATION step 1000 - loss 0.34426450729370117 - moving ave loss 3.480055101792053
Training finished, exit.

Are you using pre-trained weights ? If not try with pre-trained weights which suits your cfg file and check the results.

@onurbarut
Copy link

@gangooteli My dataset contains 4 band images (RGB plus NIR). Do you know how I can import the pre-trained weights and initialize the extras coming from the 4th channel?

@gangooteli
Copy link

@onurbarut
For pre-trained weights, you need to add extra arguments
--load yolo.weights
Basically using --load you can specify the weights you want to use while start of training

And lets say you also save newly trained weights
so to use latest saved weights, you can use --load -1

Please check the args:

Arguments:
--summary path to TensorBoard summaries directory
--momentum applicable for rmsprop and momentum optimizers
--load how to initialize the net? Either from .weights or a checkpoint, or even from scratch
--saveVideo Records video from input video or camera
--lr learning rate
--labels path to labels file
--verbalise say out loud while building graph
--imgdir path to testing directory with images
--help, --h, -h show this super helpful message and exit
--epoch number of epoch
--savepb save net and weight to a .pb file
--annotation path to annotation directory
--train train the whole net
--queue process demo in batch
--trainer training algorithm
--demo demo on webcam
--batch batch size
--gpu how much gpu (from 0.0 to 1.0)
--metaLoad path to .meta file generated during --savepb that corresponds to .pb file
--model configuration of choice
--gpuName GPU device name
--threshold detection threshold
--config path to .cfg directory
--save save checkpoint every ? training examples
--binary path to .weights directory
--pbLoad path to .pb protobuf file (metaLoad must also be specified)
--json Outputs bounding box information in json format.
--keep Number of most recent training results to save
--dataset path to dataset directory
--backup path to backup folder

For images, I think you can use some image library to convert 4 band images to .jpg images

@onurbarut
Copy link

@gangooteli I already did some modifications to the code to be able to train 4 band images. However, there is no pre-trained data for 4 band inputs, remember: the first kernel's size is 3x3xCxK where C is the number of channels. The pre-trained data contains the first kernel as 3x3x3xK, while I use 3x3x4xK. So there is mismatch of the number of elements expected and imported. However I think I can modifiy the code to import the 3x3x3xK kernels and extend its dimension to 3x3x4xK, and randomly initialize only the parameters coming due to the 4th channel. But I haven't got such deep to the source code yet.
Moreover, I couldn't use --trainer momentum .9, it gives an error. How can I choose momentum optimization?

@gangooteli
Copy link

@onurbarut I understand your issue and also understand you will change into code to make it work of 4 channels.
--trainer is used for "training algorithm/ optimizers" like Adam, Adagrad and other specified in code.
Please create another issue for this since it is off the topic of this issue and other person can also help if you will create another issue.

Thanks

@onurbarut
Copy link

Hi @Ridhwanluthra , do you reach zero in any model in any learning rate in a very very few steps, like 10 steps if --lr 1e1?? Because magically something happened and my code was broken, whatever the model, weights, learning rate I choose the loss goes to zero with almost zero accuracy, check my #512 . Is it the same with your case?
Even I deleted darkflow and re-setup but nothing chages :(. need help.

@Ridhwanluthra
Copy link
Author

@onurbarut its not the same this only happens when i am working with single class

@davie890
Copy link

Hey @Ridhwanluthra , I am trying to plot a loss graph to analyze my training data, but since I'm new to all this I'm not exactly sure where the loss data gets stored/printed to the screen. Since you were able to write the code that does the outputting can you guide me where in the code this happens?

@alvinxiii
Copy link

Hi @Ridhwanluthra. Can you share us your darkflow folder and all the codes in git? I tried modify the code #264. I encountered some errors on the code. I wanted to print the val loss values on the command prompt. Thanks.

@Ridhwanluthra
Copy link
Author

@davie890 take a look here.

@akmeraki
Copy link

akmeraki commented Jul 9, 2019

@Ridhwanluthra , Have you solved this problem ?. I have the same issue when i'm training (train:24 images, testing: 8 images, batch :2 ) . I'm training for a single class as well, still there is no output , no sign of overfitting.

error1

@Ridhwanluthra
Copy link
Author

@akmeraki I did solve it and I don't really remember the reason of this error but i believe it was something along the lines of a silly mistake with modifying the various parameters to work with my network. Make sure there is nothing like that happening. I am pretty sure its not a bug or overfitting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants