Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential issue with SARI n-gram add-score #99

Open
liamcripwell opened this issue Jun 24, 2022 · 2 comments
Open

potential issue with SARI n-gram add-score #99

liamcripwell opened this issue Jun 24, 2022 · 2 comments

Comments

@liamcripwell
Copy link

liamcripwell commented Jun 24, 2022

Hi, I have observed a particular situation with the SARI implementation where system outputs can receive a <100 score even when they are identical to the reference (where there is only a single reference).

Basically, if a reference does not introduce new tokens, it will receive a 0.00 unigram add-score, but 100 for all n>1-grams.

Take the following example:

sources=["Shu Abe (born June 7 1984) is a former Japanese football player."]
predictions=["Shu Abe (born June 7 1984) is a Japanese football player."]
references=[["Shu Abe (born June 7 1984) is a Japanese football player."]]
sari_score = corpus_sari(sources, predictions, references)
print(sari_score)

>>> 91.66666666666667

In this case, the add score will be 75.0 because there are no new unigrams (because of the if sys_total > 0: checks in compute_precision_recall_f1()) but there are technically new bigrams, trigrams, and 4-grams around the location of the deleted word (["a japanese", "a japanese football", "is a japanese"], etc.).

I am just curious of whether this is the expected behaviour or if a definitive 0.00 or 100.0 result for the add-score would be more desirable?

Thanks in advance for any insight.

@liamcripwell liamcripwell changed the title potential issue with n-gram add-score potential issue with SARI n-gram add-score Jun 24, 2022
@feralvam
Copy link
Owner

Hi!

Apologies for the very late reply.

I ran your example with the original implementation of SARI and I got the same score. So, to begin with, this is not an issue with our implementation but rather from the design of the metric itself. I think it would be a good idea to raise this issue in the SARI github repo to get the opinion of the original authors of the metric.

It would seem like giving a 0.0 for ADD makes sense in this case, because there are not actually new bigrams, trigrams or 4-grams. You could possibly expand this logic for the other operations, and always give a zero if the operation at the unigram level is already a zero. I haven't really given this much thought, though. What's your take on this?

@liamcripwell
Copy link
Author

Hi, thanks for the response!

I didn’t expect this to be something wrong with your implementation specifically, but more a quirk of the metric itself. I just posted here because this library seems to be the best evaluation library for simplification and already contains some modifications/fixes to the original SARI implementation.

To my intuition, I would expect score of 100 for cases where the outputs are identical/near-identical to the reference. Even if no n-grams are added, matching the references suggests that they are ~optimally simple and therefore should receive the highest score even if the transformation only deletes content. It doesn't make sense to me why another example could receive a higher score purely because the reference introduces a new word.

Perhaps it is not necessary to make any changes because instances of this type should still receive relatively high scores overall (like in the example above), but it seemed like an interesting edge-case and I thought it was worth bringing up to see if anyone else had thoughts on this.

Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants