Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation on the test set #33

Open
liang-hou opened this issue Oct 6, 2020 · 1 comment
Open

Evaluation on the test set #33

liang-hou opened this issue Oct 6, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@liang-hou
Copy link

liang-hou commented Oct 6, 2020

Hi, thanks for the excellent work.
But I noticed that the test set is not available for evaluation, because you don't specify the split argument in the metrics functions. I think it will load the train set for evaluating FID by default. Many GAN papers including SSGAN splits the data set as a train set and a test set. I guess that they evaluated their models using the test set.

Another minor concern is that the inception model will be downloaded and stored in each experiment, which is a waste of time and storage. It would be better to discard the log_dir from the inception_path. Then the inception model will be cached and reused for all experiments.

@liang-hou liang-hou changed the title train-test split Evaluation on test set Oct 9, 2020
@liang-hou liang-hou changed the title Evaluation on test set Evaluation on the test set Oct 9, 2020
@kwotsin
Copy link
Owner

kwotsin commented Dec 16, 2020

Hi @houliangict, yes you are right there are differences in how the field computes FID these days. Recall that since the goal for GANs in this case is to model the training distribution, evaluating on the training data is seen as a way to evaluate GAN performance on such modelling. There are other variants like using test data for evaluation instead (if I recall, particularly for works from Google), and some works propose to use both "train" and "test" splits for training/evaluation together, but I can't say for sure what is the recommended way since there doesn't seem to be a consensus in the field.

Regardless, I think it is important to be consistent in evaluation if comparing between various models. From some initial experience running FID on test dataset (e.g. on CIFAR-10 train vs test), I noticed the margins are very comparable.

That said, I think your suggestions are great and I will certainly look into them! Hope this helps!

@kwotsin kwotsin added the enhancement New feature or request label Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants