rougeL returns 0 score on perfect prediction in some languages #441

yoavkatz · 2024-01-01T10:55:08Z

Change xlsum.py to run on all languages (remove if lang == langs[0]:)
Run python prepare/cards/xlsum.py

Traceback (most recent call last):
File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 47, in test_preprations
import_module_from_file(file)
File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 27, in import_module_from_file
spec.loader.exec_module(module)
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/home/runner/work/unitxt/unitxt/prepare/cards/xlsum.py", line 42, in
test_card(card, debug=False)
File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 238, in test_card
test_with_eval(
File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 184, in test_with_eval
raise AssertionError(error_message)
AssertionError: The results of running the main metric in used in the card (rougeL) over simulated predictions that are equal to the references returns a different score than expected.
One would expect a perfect score of 1.0 in this case, but returned metric score was 0.0.
This usually indicates an error in the metric or post processors, but can be also an acceptable edge case.
In anycase, this requires a review. If this is acceptable, set strict=False in the call to test_card().
The predictions passed to the metrics were:
['በታይዋን ከአንዲት ሴት አይን ውስጥ ዶክተሮች አራት ንቦችን አወጡ። በደሴቲቱ እንዲህ አይነት ነገር ታይቶም ተሰምቶም አይታወቅም ሲሉ ተናግረዋል።', 'ከሰሞኑ ባለቤትነታቸው የአረና ትግራይ ፓርቲ አባል ናቸው የተባሉ አስራ ስድስት ፍየሎች የመታሰራቸው ዜና የማህበራዊ ሚዲያ ተጠቃሚዎች መነጋገሪያ ሆኖ ቆይቷል።', 'የአሜሪካው ፕሬዝደንት ዶናልድ ትራምፕ ቲክ ቶክ የተሰኘው የተንቀሳቃሽ ምስሎች መጋሪያ በአሜሪካ ድርጅት ካልተገዛ ሊያግዱት እንደሚችሉ አስጠንቅቀዋል።']

The text was updated successfully, but these errors were encountered:

dafnapension · 2024-01-07T17:20:39Z

My 2 cents, having dug some:
Rouge employs its default tokenizer as first step to computing the score.
When lang = nepali, for example, no token identified, nor in prediction nor in target. Hence the score is 0 for all three examples.
When lang = marathi (!!) it gladly jumps on a '22' encountered in the string, and then target and prediction both are 'tokenized' to: ['22'], and this is a hit! being one of 3 examples, the final score is 0.3333 for this language.

I am not sure how to automatically recognize the language, or whether to use the known language in this case, and where to pull an adequate tokenizer from. whitespace tokenizer does some work, but this is just to prove a concept..

yoavkatz assigned gitMichal Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rougeL returns 0 score on perfect prediction in some languages #441

rougeL returns 0 score on perfect prediction in some languages #441

yoavkatz commented Jan 1, 2024 •

edited

dafnapension commented Jan 7, 2024

rougeL returns 0 score on perfect prediction in some languages #441

rougeL returns 0 score on perfect prediction in some languages #441

Comments

yoavkatz commented Jan 1, 2024 • edited

dafnapension commented Jan 7, 2024

yoavkatz commented Jan 1, 2024 •

edited