Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table notation for reproducibility #35

Open
AntonioCarta opened this issue Dec 7, 2022 · 10 comments
Open

Table notation for reproducibility #35

AntonioCarta opened this issue Dec 7, 2022 · 10 comments

Comments

@AntonioCarta
Copy link
Contributor

AntonioCarta commented Dec 7, 2022

I propose to switch the notation. Right now we have:

  • ✅ Reproduced
  • ❌ Custom setup
  • bug for bugs

IMO, this is very confusing at a first glance. If I see a big red cross I immediately think there is a problem with the strategy. In this case, everything is actually correct, we just changed some hyperparameters or tested a new benchmark.

Instead we could have two separate columns:

  • Reproduced: ✅ if correct, ❌ if bugged
  • Reference: link to the paper, or link to avalanche or custom tag if not using any paper.
@AndreaCossu
Copy link
Contributor

The current meaning is actually different:
tick = we are able to reproduce the target performance of the reference paper (we do not necessarily use the same setup of the reference paper)
cross = we are not able to reproduce the target performance of the reference paper. We do not know if this is due to a bug in the strategy.
bug = we are not able to reproduce the target performance of the reference paper due to a bug in the strategy, for sure.

@AntonioCarta
Copy link
Contributor Author

Ok, I misunderstood the notation. Maybe we should add how far we are from the target result?

@AndreaCossu
Copy link
Contributor

Yes, we can. I didn't want to clutter the table so I put the reference performance inside the comments in the experiments.
I think we could create a separate table in the README to briefly show the gap.
I also created issue #33 to keep track of what's missing. I could also add the gap there.

@AntonioCarta
Copy link
Contributor Author

Maybe we need to strictly separate two types of experiments:

  • paper reproductions which are exactly reproducing a paper
  • baselines which provides clean implementations but may have a lower accuracy.

IMO CLB is still valuable as long as the methods in avalanche are correct and the clean implementation provides a reasonable reference value. Reproducing papers requires digging into whatever tricks the authors decided to add. While useful, it's very time consuming and we cannot afford to do it ourselves, as we have already seen. Of course we can support external contributions on this.

@AndreaCossu
Copy link
Contributor

With paper reproductions do you also mean same hyperparameters as original paper? In the end, I think that is less interesting (and we would have only few strategies marked as such). One would probably use CL baselines to understand how to reach the same performance as the original paper, even though hyperparameters may differ. I guess that better describes the concept of reproducibility when you use a different codebase than the one you are trying to reproduce.

@AntonioCarta
Copy link
Contributor Author

With paper reproductions do you also mean same hyperparameters as original paper?

Same performance, scenario, model architectures, and so on. Some hyperparameters (lr, regularization strength) may change due to minor differences in the framework/implementation.

@AndreaCossu
Copy link
Contributor

I changed the table in the README. It now shows Avalanche when the experiment is not present in a specific paper. I also added the reference performance with the related paper (when available).

@AntonioCarta
Copy link
Contributor Author

AntonioCarta commented Mar 16, 2023

This is a nice improvement. Do we have any explanation about the gaps of some experiments? e.g. different hparams, less epochs,...

@AndreaCossu
Copy link
Contributor

Not really, we can speculate but nothing more at the moment.

@AntonioCarta
Copy link
Contributor Author

It's fine, but we should keep track of this somewhere. At least a log with attempts, some notes about what failed. Not sure about the form of it, a comment in the header of the script may be enough.

For example, maybe we find out that the difference is due to a mistake in the original paper (e.g. they look at the validation instead of test loss). In this case, we should explain the reason behind the performance difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants