-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding MIRACL Retrieval #642
Adding MIRACL Retrieval #642
Conversation
merge latest mteb branch with main
I'm not sure whether the languages considered in MIRACL are new to be considered for a bonus. Nevertheless, I have added 2 points for adding the MIRACL dataset. Hope it helps! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! Thank you for this great addition, below my comments 🙂
_LANGS = {"de": ["deu-Latn"], "es": ["spa-Latn"]} | ||
_EVAL_SPLIT = "dev" | ||
|
||
_LANGUAGES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fin
, ind
, swa
and yor
are new languages, you gain 4*4 bonus points!
n_samples=None, | ||
avg_character_length=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can sum up the corpus lengths for all languages and give their average character length
"hi": ["hin-Deva"], | ||
"id": ["ind-Latn"], | ||
"ja": ["jpn-Jpan"], | ||
"ko": ["kor-Kore"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the Korean Miracl task file since it's included here
@@ -0,0 +1 @@ | |||
{"GitHub": "thakur-nandan", "New dataset": 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{"GitHub": "thakur-nandan", "New dataset": 2} | |
{"GitHub": "thakur-nandan", "New dataset": 18} |
@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode. @thakur-nandan if u have known results for a non e5 model, can you rerun with that and confirm that the discrepancy is atleast smaller then? |
So you're saying there is an issue with the |
Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up). This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that. |
I guess that if we're not appending the correct information to the evaluator then the issue is not only with E5, but with other models also? It would be nice if you can open an issue with your observation, I'll take a look at it then and try to fix it |
…ce on miracl and other datasets
@Muennighoff @KennethEnevoldsen @imenelydiaker I found out an issue with the retrieval dataset evaluation is that the After including the following changes, I'm running mContriever scores on MIRACL retrieval for all languages and checking them. A quick evaluation on Yoruba I can achieve 0.4182 nDCG@10 with MTEB (original reported: 0.415). |
@thakur-nandan is seems like this PR will influence the score of other tasks, which might be problematic for comparisons. @Muennighoff what is the best approach here? I see two potential solutions:
|
I think @thakur-nandan probably knows best how to reconcile it with Quora & ArguAna as he created them? |
if len(result_heaps[query_id]) < top_k: | ||
# Push item on the heap | ||
heapq.heappush(result_heaps[query_id], (score, corpus_id)) | ||
else: | ||
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element | ||
heapq.heappushpop( | ||
result_heaps[query_id], (score, corpus_id) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the previous discussion and my understanding, I think that we can't really remove this condition since it was introduced by BEIR for Quora & ArguAna that are included in MTEB.
In this case, I'd say that it would be better to have a flag to activate/deactivate the self retrieval, so that the evaluator can also be used with MIRACL. This would mean that we would keep both scores as @KennethEnevoldsen suggested.
Else, if removing self retrieval and running again Quora and ArguAna makes sense then we should do it. Let me know what you think @thakur-nandan ?
Thanks for checking this PR. So, the scores will not be affected as the self-retrieval is double-checked during evaluation as well here with the flag
Hence, AFAIK we can safely remove the line I have two suggestions here:(1) Keep the code as is with Since you are the PR reviewers: The veto power lies with you and I'll let you all decide: @Muennighoff @KennethEnevoldsen @imenelydiaker. Thanks, |
@thakur-nandan I believe option 2 is the desirable option. Though I would not want the user to switch it. Instead, I would a) create two separate scores (one with and one without) or b) allow the argument to be overwritten during dataset construction:
Either approach you want to implement is fine with me, but I would probably prefer a) (however I will accept either if one is easier to implement go for that one). |
@thakur-nandan we would go for option 2 as @KennethEnevoldsen mentioned it, we would love your help on this! 🙂 |
@thakur-nandan I would love to get this PR merged in as soon as possible. Would you have the time to do this? |
Hi @KennethEnevoldsen @imenelydiaker thanks for your suggestions on the topic, I'll start with the 2 (a) suggestion of keeping separate scores for nDCG@10 with and without self-retrieval. I didn't get time recently to have a look at the PR. I will try to get it done by tomorrow's EoD. Regards, |
Wonderful to hear @thakur-nandan! Will keep an eye out for it such that the review is resolved quickly |
@imenelydiaker any chance you can finish up this PR? I have started finishing up #641. |
Will do yes! |
@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR. Apologies for the delay! |
Thank you @thakur-nandan for this great work, we'll finish it up! 🙂 |
Merging as in #641. |
I am adding MIRACL Retrieval as discussed in #198.
Checklist for adding MMTEB dataset
Reason for dataset addition:
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).Thank you for waiting. I have the MIRACL Retrieval nDCG@10 scores ready for the following model:
intfloat/multilingual-e5-small
. I achieved much lower scores than reported in the E5 paper - Table 6 (https://arxiv.org/abs/2402.05672). I am running the mContriever model (link) and will update the PR once I have all scores compared against MIRACL 2CR (link).I was hoping someone could look into the difference in reproduction and find the issue.
Regards,
Nandan