train loss and validation loss tending to 0 #494

Ridhwanluthra · 2017-12-30T11:46:28Z

I added the functionality to print validation loss based on the code given by #264 (comment)

I have a single label in my problem
I have edited the cfg file correctly with classes as 1 and filters in second last layer as 30
Even before these changes the training loss was tending to 0, I though it was over fitting but adding validation loss proved that it is not that.
There are no bounding boxes generated.
The validation loss does not affect training.
A snapshot of the losses, moving ave loss is training moving average loss and val one is for validation

step 2163 - moving ave loss 1.31846549355e-05 - val moving ave loss 1.35019140267e-05
step 2164 - moving ave loss 1.311464045e-05 - val moving ave loss 1.33585831669e-05
step 2165 - moving ave loss 1.31301993879e-05 - val moving ave loss 1.33837859452e-05
step 2166 - moving ave loss 1.3074759692e-05 - val moving ave loss 1.32738299211e-05
step 2167 - moving ave loss 1.30020373768e-05 - val moving ave loss 1.32209596895e-05
step 2168 - moving ave loss 1.2891638246e-05 - val moving ave loss 1.31455781394e-05
step 2169 - moving ave loss 1.28760445816e-05 - val moving ave loss 1.31389315505e-05
step 2170 - moving ave loss 1.28287536095e-05 - val moving ave loss 1.30365870197e-05
step 2171 - moving ave loss 1.29534266191e-05 - val moving ave loss 1.29940787821e-05
step 2172 - moving ave loss 1.29592243579e-05 - val moving ave loss 1.30160897623e-05

any help would be greatly appreciated
Thank you

The text was updated successfully, but these errors were encountered:

gangooteli · 2017-12-31T07:58:13Z

you also have to provide labels.txt as argument.
In that file, you need to include only one class which you are using.
I also trained with 1 class it was working fine.

Ridhwanluthra · 2017-12-31T08:11:10Z

@gangooteli I had provided labels.txt, also if there would be an inconsistency in that the training would not have started.

onurbarut · 2017-12-31T16:28:11Z

@gangooteli I am also trying to train with only one class from scratch. So far I was trying to train tiny-yolo. and until 400k steps with 1k epoch I only obtained loss around 7. How many steps would it take to make a good train, i.e. loss below 1?

gangooteli · 2017-12-31T18:12:14Z

@onurbarut I ran it on GTSRB dataset. I used pre-trained weights and trained on top of that. It took me around 200 epochs to converge.
Use of pre-trained weights will make it faster to converge.

@Ridhwanluthra
Here is my sample logs
step 987 - loss 0.6274503469467163 - moving ave loss 0.536738667892405
step 988 - loss 0.45404353737831116 - moving ave loss 0.5284691548409957
step 989 - loss 0.444408118724823 - moving ave loss 0.5200630512293785
step 990 - loss 0.48710036277770996 - moving ave loss 0.5167667823842116
step 991 - loss 0.2570722997188568 - moving ave loss 0.4907973341176761
step 992 - loss 0.37810787558555603 - moving ave loss 0.47952838826446414
step 993 - loss 0.6285824775695801 - moving ave loss 0.49443379719497577
step 994 - loss 0.40015703439712524 - moving ave loss 0.4850061209151907
step 995 - loss 0.2761436104774475 - moving ave loss 0.46411986987141635
step 996 - loss 0.23099832236766815 - moving ave loss 0.44080771512104155
step 997 - loss 0.2307831346988678 - moving ave loss 0.41980525707882416
step 998 - loss 0.5912097096443176 - moving ave loss 0.4369457023353735
step 999 - loss 0.6355569958686829 - moving ave loss 0.45680683168870445
step 1000 - loss 0.3976811468601227 - moving ave loss 0.4508942632058463
Checkpoint at step 1000
Finished saving checkpoint
VALIDATION step 1000 - loss 0.34426450729370117 - moving ave loss 3.480055101792053
Training finished, exit.

Are you using pre-trained weights ? If not try with pre-trained weights which suits your cfg file and check the results.

onurbarut · 2017-12-31T18:26:53Z

@gangooteli My dataset contains 4 band images (RGB plus NIR). Do you know how I can import the pre-trained weights and initialize the extras coming from the 4th channel?

gangooteli · 2017-12-31T19:42:06Z

@onurbarut
For pre-trained weights, you need to add extra arguments
--load yolo.weights
Basically using --load you can specify the weights you want to use while start of training

And lets say you also save newly trained weights
so to use latest saved weights, you can use --load -1

Please check the args:

Arguments:
--summary path to TensorBoard summaries directory
--momentum applicable for rmsprop and momentum optimizers
--load how to initialize the net? Either from .weights or a checkpoint, or even from scratch
--saveVideo Records video from input video or camera
--lr learning rate
--labels path to labels file
--verbalise say out loud while building graph
--imgdir path to testing directory with images
--help, --h, -h show this super helpful message and exit
--epoch number of epoch
--savepb save net and weight to a .pb file
--annotation path to annotation directory
--train train the whole net
--queue process demo in batch
--trainer training algorithm
--demo demo on webcam
--batch batch size
--gpu how much gpu (from 0.0 to 1.0)
--metaLoad path to .meta file generated during --savepb that corresponds to .pb file
--model configuration of choice
--gpuName GPU device name
--threshold detection threshold
--config path to .cfg directory
--save save checkpoint every ? training examples
--binary path to .weights directory
--pbLoad path to .pb protobuf file (metaLoad must also be specified)
--json Outputs bounding box information in json format.
--keep Number of most recent training results to save
--dataset path to dataset directory
--backup path to backup folder

For images, I think you can use some image library to convert 4 band images to .jpg images

onurbarut · 2018-01-01T11:14:38Z

@gangooteli I already did some modifications to the code to be able to train 4 band images. However, there is no pre-trained data for 4 band inputs, remember: the first kernel's size is 3x3xCxK where C is the number of channels. The pre-trained data contains the first kernel as 3x3x3xK, while I use 3x3x4xK. So there is mismatch of the number of elements expected and imported. However I think I can modifiy the code to import the 3x3x3xK kernels and extend its dimension to 3x3x4xK, and randomly initialize only the parameters coming due to the 4th channel. But I haven't got such deep to the source code yet.
Moreover, I couldn't use --trainer momentum .9, it gives an error. How can I choose momentum optimization?

gangooteli · 2018-01-01T16:24:28Z

@onurbarut I understand your issue and also understand you will change into code to make it work of 4 channels.
--trainer is used for "training algorithm/ optimizers" like Adam, Adagrad and other specified in code.
Please create another issue for this since it is off the topic of this issue and other person can also help if you will create another issue.

Thanks

onurbarut · 2018-01-08T17:23:40Z

Hi @Ridhwanluthra , do you reach zero in any model in any learning rate in a very very few steps, like 10 steps if --lr 1e1?? Because magically something happened and my code was broken, whatever the model, weights, learning rate I choose the loss goes to zero with almost zero accuracy, check my #512 . Is it the same with your case?
Even I deleted darkflow and re-setup but nothing chages :(. need help.

Ridhwanluthra · 2018-01-10T15:32:43Z

@onurbarut its not the same this only happens when i am working with single class

davie890 · 2018-02-15T18:45:45Z

Hey @Ridhwanluthra , I am trying to plot a loss graph to analyze my training data, but since I'm new to all this I'm not exactly sure where the loss data gets stored/printed to the screen. Since you were able to write the code that does the outputting can you guide me where in the code this happens?

alvinxiii · 2018-02-23T05:55:44Z

Hi @Ridhwanluthra. Can you share us your darkflow folder and all the codes in git? I tried modify the code #264. I encountered some errors on the code. I wanted to print the val loss values on the command prompt. Thanks.

Ridhwanluthra · 2018-03-28T05:11:21Z

@davie890 take a look here.

akmeraki · 2019-07-09T21:44:28Z

@Ridhwanluthra , Have you solved this problem ?. I have the same issue when i'm training (train:24 images, testing: 8 images, batch :2 ) . I'm training for a single class as well, still there is no output , no sign of overfitting.

Ridhwanluthra · 2019-07-10T18:11:18Z

@akmeraki I did solve it and I don't really remember the reason of this error but i believe it was something along the lines of a silly mistake with modifying the various parameters to work with my network. Make sure there is nothing like that happening. I am pretty sure its not a bug or overfitting.

Ridhwanluthra closed this as completed Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train loss and validation loss tending to 0 #494

train loss and validation loss tending to 0 #494

Ridhwanluthra commented Dec 30, 2017

gangooteli commented Dec 31, 2017

Ridhwanluthra commented Dec 31, 2017

onurbarut commented Dec 31, 2017

gangooteli commented Dec 31, 2017

onurbarut commented Dec 31, 2017

gangooteli commented Dec 31, 2017

onurbarut commented Jan 1, 2018

gangooteli commented Jan 1, 2018

onurbarut commented Jan 8, 2018

Ridhwanluthra commented Jan 10, 2018

davie890 commented Feb 15, 2018

alvinxiii commented Feb 23, 2018

Ridhwanluthra commented Mar 28, 2018

akmeraki commented Jul 9, 2019

Ridhwanluthra commented Jul 10, 2019

train loss and validation loss tending to 0 #494

train loss and validation loss tending to 0 #494

Comments

Ridhwanluthra commented Dec 30, 2017

gangooteli commented Dec 31, 2017

Ridhwanluthra commented Dec 31, 2017

onurbarut commented Dec 31, 2017

gangooteli commented Dec 31, 2017

onurbarut commented Dec 31, 2017

gangooteli commented Dec 31, 2017

onurbarut commented Jan 1, 2018

gangooteli commented Jan 1, 2018

onurbarut commented Jan 8, 2018

Ridhwanluthra commented Jan 10, 2018

davie890 commented Feb 15, 2018

alvinxiii commented Feb 23, 2018

Ridhwanluthra commented Mar 28, 2018

akmeraki commented Jul 9, 2019

Ridhwanluthra commented Jul 10, 2019