Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.fit() and eager_tf generates different training results #384

Open
silvaurus opened this issue Oct 27, 2021 · 2 comments
Open

model.fit() and eager_tf generates different training results #384

silvaurus opened this issue Oct 27, 2021 · 2 comments

Comments

@silvaurus
Copy link

silvaurus commented Oct 27, 2021

Hello!

I didn't change the code and use both model.fit() and eager_tf to train the network.

For model.fit() the avg validation loss value is < 50 even in the first epoch. And the training loss value also goes < 50 in the beginning of the second epoch.

For eager_tf the validation loss stays at ~ 200 after 10 epochs, and the training loss decreases much slower, and goes to ~50 in the 10th epoch, which looks like overfitting.

This is the training result for model.fit():
Epoch 1:
1/358

  • loss: 9787.6289 - yolo_output_0_loss: 508.0005 - yolo_output_1_loss: 1342.9556 - yolo_output_2_loss: 7925.9561

...

357/358

  • loss: 378.2877 - yolo_output_0_loss: 22.6362 - yolo_output_1_loss: 49.9713 - yolo_output_2_loss: 294.6154

358/358

  • loss: 378.0025 - yolo_output_0_loss: 22.6236 - yolo_output_1_loss: 49.9357 - yolo_output_2_loss: 294.3785

val_loss: 51.9096 - val_yolo_output_0_loss: 8.8620 - val_yolo_output_1_loss: 7.8781 - val_yolo_output_2_loss: 24.0912

Epoch 2:
1/358

  • loss: 43.6244 - yolo_output_0_loss: 6.2404 - yolo_output_1_loss: 8.0534 - yolo_output_2_loss: 18.2523

Notice this sudden transition of training loss from 378 to 43 - this is because model.fit() reports the average among all the iterations in one batch.

This is the training result for eager_tf:
1_train_0, 155262.8125, [5675.242, 34116.484, 115460.375]
...

1_train_356, 523.5953369140625, [124.26721, 100.35405, 287.8407]
1_train_357, 125.0768814086914, [25.127472, 11.3394575, 77.47637]
1_val_0, 565.5044555664062, [86.86941, 158.40671, 309.0946]
...
1_val_363, 694.1661987304688, [114.45209, 213.89682, 354.6836]

(Average) 1, train: 5050.33447265625, val: 590.8134155273438

2_train_0, 788.0953369140625, [132.88559, 241.86014, 402.21585]
2_train_1, 493.3677978515625, [86.920746, 157.22601, 238.08711]

Notice that here the losses are per-iteration losses and are not averaged.
ever since the first iteration, the loss values are much bigger than model.fit(), and at the end of epoch 1, the loss is >100, which is much worse compared with < 50 in model.fit().

I strictly follow the tutorial used for training and used the datasets / darknet model downloaded directly from the links provided.

I guess this might relate to the different process of loss functions.
Do you by any chance know why?

@silvaurus
Copy link
Author

My current guess is that in eager_tf mode, the total_losses (minus regularization loss) are not divided by the batch size.

@ZXTFINAL
Copy link

You have to ensure these two methods print the same thing.First, It seems that your loss has not average. Second, their data batch is different,"model.fit()"maybe use "random batch", the other one use all the batch,not random,so, the difference is reasonable... : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants