Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider model variance in bootstrap resampling test #126

Open
odashi opened this issue Oct 30, 2021 · 6 comments
Open

Consider model variance in bootstrap resampling test #126

odashi opened this issue Oct 30, 2021 · 6 comments

Comments

@odashi
Copy link

odashi commented Oct 30, 2021

This issue brings up a problem about using so-called "bootstrap resampling test" for evaluating "statistical significance" of machine translation (especially neural MT) methods, and similar generation tasks that are evaluated by MT metrics.

In this criterion, the evaluator will choose several number of generated sentences randomly to simulate the distribution of model outputs, but the evaluator does not consider the variance of the trained model itself.

Consider that we have a baseline model to beat, and our champion model of the proposed methods. The champion model will be produced regardless of recognizing it by authors, e.g., if the model was trained only a few times, any people may not be able to judge if the model is the outlier on the model distribution or not.

In this situation, the "bootstrap resampling test" may judge the model of the proposed method is significantly better, but the evaluation was actually employed for only one model variant which may be a champion, and did not consider any distributional properties of the proposed model.

The "bootstrap resampling test" was introduced on the era of statistical MT, and I guessed the method historically produced reasonable judgements for SMT systems because their study was usually investigating some additions of almost-fixed systems such as Moses (note that I said "almost-fixed" here because they also had random tuning for hyperparameters). In neural MT systems, this assumption had gone because the systems were randomly trained from scratch, and the "bootstrap resampling test" may no longer produce meaningful results but rather give the model a wrong authority.

I was observing continuously that the "bootstrap resampling test" was still utilized in many papers to give "statistical significance" of the model, and strongly worried about misleading this line of research.

@neubig
Copy link
Contributor

neubig commented Oct 30, 2021

I agree, but at the same time I think any statistical testing is probably better than none.
Rather than removing testing altogether, it would probably be better to implement testing that also accounts for optimizer instability such as that described in this paper: https://aclanthology.org/P11-2031/

@odashi
Copy link
Author

odashi commented Oct 30, 2021

I think any statistical testing is probably better than none.

I don't fully agree with this. A problem of employing statistical testing is that users and reviewers sometimes believe the results regardless of its appropriateness (this usually happens too in other testing, such as so-called "p-value faith"). Many papers used "bootstrap resampling test" regardless of the property I noted in the first comment, and I guessed "not-sure" is much better than wrong authority in this case.

@neubig
Copy link
Contributor

neubig commented Oct 30, 2021

That's a fair point. For the time being I've explained this in a little more detail in the README: https://github.com/neulab/compare-mt#significance-tests

The longer term solution would be implementing tests such as the ones proposed by Clark et al. above in compare-mt. For the time being, anyone who finds this issue and wants to control for random seed selection can use multeval instead.

@odashi
Copy link
Author

odashi commented Oct 30, 2021

Thanks for clarifying!
I think it would be helpful to add a link to this issue in README since the further discussion may happen here, and replace the title "statistical testing" to "... for single models" or something similar.

@kpu
Copy link

kpu commented Oct 30, 2021

I agree, but at the same time I think any statistical testing is probably better than none.

I think a statistical test that always claims significance (and bootstrap does in my experience) is worse than none at all. The papers that do run tests usually have too small of an effect size to be useful and gussy it up behind a probably random significance test. I find the reviewers demanding significance tests, when they should know there isn't really one that works, especially annoying.

@odashi
Copy link
Author

odashi commented Oct 31, 2021

Every statistical test is meaningful if and only if the underlying hypothesis is suitable. As for the bootstrap resampling, the H0 of this test is that this particular system produces as the same accuracy as this particular baseline so it may be usable if the authors really wanted to reject this H0. But some papers accidentally introduced this test to argue the significance of a method that involves some model distribution. This kind of judgement should be infeasible unless the method produces the same system every time.

too small of an effect size

Yes this is also a problem by ignoring model variances...

@neubig neubig changed the title Consider to avoid bootstrap resampling test Consider model variance in bootstrap resampling test Feb 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants