Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding MIRACL Retrieval #642

Merged

Conversation

thakur-nandan
Copy link

@thakur-nandan thakur-nandan commented May 6, 2024

I am adding MIRACL Retrieval as discussed in #198.

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores). See discussion below. I am getting lower eval scores than reported.
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Thank you for waiting. I have the MIRACL Retrieval nDCG@10 scores ready for the following model: intfloat/multilingual-e5-small. I achieved much lower scores than reported in the E5 paper - Table 6 (https://arxiv.org/abs/2402.05672). I am running the mContriever model (link) and will update the PR once I have all scores compared against MIRACL 2CR (link).

I was hoping someone could look into the difference in reproduction and find the issue.

MIRACL Dev Original (Reported) MTEB (Repro)
ar 0.714 0.678
bn 0.682 0.672
de - 0.434
en 0.48 0.425
es 0.512 0.455
fa 0.533 0.467
fi 0.733 0.699
fr 0.476 0.403
hi 0.552 0.510
id 0.507 0.473
ja 0.636 0.590
ko 0.612 0.591
ru 0.591 0.542
sw 0.684 0.652
te 0.813 0.793
th 0.75 0.697
yo - 0.124
zh 0.459 0.375

Regards,
Nandan

@thakur-nandan thakur-nandan mentioned this pull request May 6, 2024
@thakur-nandan
Copy link
Author

thakur-nandan commented May 6, 2024

I'm not sure whether the languages considered in MIRACL are new to be considered for a bonus.

Nevertheless, I have added 2 points for adding the MIRACL dataset.

Hope it helps!

Copy link
Contributor

@imenelydiaker imenelydiaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! Thank you for this great addition, below my comments 🙂

_LANGS = {"de": ["deu-Latn"], "es": ["spa-Latn"]}
_EVAL_SPLIT = "dev"

_LANGUAGES = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fin, ind, swa and yor are new languages, you gain 4*4 bonus points!

Comment on lines 137 to 138
n_samples=None,
avg_character_length=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can sum up the corpus lengths for all languages and give their average character length

"hi": ["hin-Deva"],
"id": ["ind-Latn"],
"ja": ["jpn-Jpan"],
"ko": ["kor-Kore"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the Korean Miracl task file since it's included here

@@ -0,0 +1 @@
{"GitHub": "thakur-nandan", "New dataset": 2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"GitHub": "thakur-nandan", "New dataset": 2}
{"GitHub": "thakur-nandan", "New dataset": 18}

@Andrian0s
Copy link
Contributor

Andrian0s commented May 7, 2024

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

@thakur-nandan if u have known results for a non e5 model, can you rerun with that and confirm that the discrepancy is atleast smaller then?

@imenelydiaker
Copy link
Contributor

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

@Andrian0s
Copy link
Contributor

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

@imenelydiaker
Copy link
Contributor

imenelydiaker commented May 7, 2024

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

I guess that if we're not appending the correct information to the evaluator then the issue is not only with E5, but with other models also? It would be nice if you can open an issue with your observation, I'll take a look at it then and try to fix it

@thakur-nandan
Copy link
Author

thakur-nandan commented May 7, 2024

@Muennighoff @KennethEnevoldsen @imenelydiaker I found out an issue with the retrieval dataset evaluation is that the query_id, doc_id are always explicitly removed if they are the same. This was introduced in BEIR to avoid self-retrieval in Quora and ArguAna. But, this is leading to lower performances on MIRACL.

After including the following changes, I'm running mContriever scores on MIRACL retrieval for all languages and checking them. A quick evaluation on Yoruba I can achieve 0.4182 nDCG@10 with MTEB (original reported: 0.415).

@KennethEnevoldsen
Copy link
Contributor

@thakur-nandan is seems like this PR will influence the score of other tasks, which might be problematic for comparisons. @Muennighoff what is the best approach here?

I see two potential solutions:

  1. Updating the scores on Quora and ArguAna to utilise the new score or do It for MIRACL (this seems problematic for comparison)
  2. Alternatively solution is to use both score nDCG@10 and nDCG@10(no self retrieval) (I believe this approach is best)

@Muennighoff
Copy link
Contributor

I think @thakur-nandan probably knows best how to reconcile it with Quora & ArguAna as he created them?
The 2nd approach sounds good to me.

Comment on lines +157 to +164
if len(result_heaps[query_id]) < top_k:
# Push item on the heap
heapq.heappush(result_heaps[query_id], (score, corpus_id))
else:
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
heapq.heappushpop(
result_heaps[query_id], (score, corpus_id)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the previous discussion and my understanding, I think that we can't really remove this condition since it was introduced by BEIR for Quora & ArguAna that are included in MTEB.

In this case, I'd say that it would be better to have a flag to activate/deactivate the self retrieval, so that the evaluator can also be used with MIRACL. This would mean that we would keep both scores as @KennethEnevoldsen suggested.

Else, if removing self retrieval and running again Quora and ArguAna makes sense then we should do it. Let me know what you think @thakur-nandan ?

@thakur-nandan
Copy link
Author

thakur-nandan commented May 10, 2024

Thanks for checking this PR.

So, the scores will not be affected as the self-retrieval is double-checked during evaluation as well here with the flag ignore_identical_ids set to True, which is the desirable way to go.

if ignore_identical_ids:

Hence, AFAIK we can safely remove the line if corpus_id != query_id: that is included in the PR @imenelydiaker @KennethEnevoldsen.

I have two suggestions here:

(1) Keep the code as is with ignore_identical_ids=True but inform users to keep the query_ids and document_ids are distinct from each other, e.g. for MIRACL I pass ignore_identical_ids=False.
(2) Change the default to ignore_identical_ids=False, however, make sure to either hard-code it or remind authors to keep changing the ignore_identical_ids=True for ArguAna and Quora in BEIR.

Since you are the PR reviewers: The veto power lies with you and I'll let you all decide: @Muennighoff @KennethEnevoldsen @imenelydiaker.

Thanks,
Nandan

@KennethEnevoldsen
Copy link
Contributor

@thakur-nandan I believe option 2 is the desirable option. Though I would not want the user to switch it. Instead, I would a) create two separate scores (one with and one without) or b) allow the argument to be overwritten during dataset construction:

class ArguAna(AbsTaskRetrieval):
    ignore_identical_ids=True

    metadata = TaskMetadata(
        name="ArguAna",
        ...
        )

Either approach you want to implement is fine with me, but I would probably prefer a) (however I will accept either if one is easier to implement go for that one).

@imenelydiaker
Copy link
Contributor

@thakur-nandan we would go for option 2 as @KennethEnevoldsen mentioned it, we would love your help on this! 🙂

@KennethEnevoldsen
Copy link
Contributor

@thakur-nandan I would love to get this PR merged in as soon as possible. Would you have the time to do this?

@thakur-nandan
Copy link
Author

thakur-nandan commented May 21, 2024

Hi @KennethEnevoldsen @imenelydiaker thanks for your suggestions on the topic, I'll start with the 2 (a) suggestion of keeping separate scores for nDCG@10 with and without self-retrieval. I didn't get time recently to have a look at the PR. I will try to get it done by tomorrow's EoD.

Regards,
Nandan

@KennethEnevoldsen
Copy link
Contributor

Wonderful to hear @thakur-nandan! Will keep an eye out for it such that the review is resolved quickly

@KennethEnevoldsen KennethEnevoldsen mentioned this pull request May 22, 2024
4 tasks
@KennethEnevoldsen
Copy link
Contributor

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

@imenelydiaker
Copy link
Contributor

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

Will do yes!

@thakur-nandan
Copy link
Author

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

@imenelydiaker
Copy link
Contributor

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

Thank you @thakur-nandan for this great work, we'll finish it up! 🙂

@imenelydiaker imenelydiaker changed the base branch from main to miracl-retrieval May 27, 2024 14:15
@imenelydiaker
Copy link
Contributor

Merging as in #641.

@imenelydiaker imenelydiaker merged commit 99f301a into embeddings-benchmark:miracl-retrieval May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants