Adding MIRACL Retrieval #642

thakur-nandan · 2024-05-06T23:06:13Z

I am adding MIRACL Retrieval as discussed in #198.

Checklist for adding MMTEB dataset

Reason for dataset addition:

Thank you for waiting. I have the MIRACL Retrieval nDCG@10 scores ready for the following model: intfloat/multilingual-e5-small. I achieved much lower scores than reported in the E5 paper - Table 6 (https://arxiv.org/abs/2402.05672). I am running the mContriever model (link) and will update the PR once I have all scores compared against MIRACL 2CR (link).

I was hoping someone could look into the difference in reproduction and find the issue.

MIRACL Dev	Original (Reported)	MTEB (Repro)
ar	0.714	0.678
bn	0.682	0.672
de	-	0.434
en	0.48	0.425
es	0.512	0.455
fa	0.533	0.467
fi	0.733	0.699
fr	0.476	0.403
hi	0.552	0.510
id	0.507	0.473
ja	0.636	0.590
ko	0.612	0.591
ru	0.591	0.542
sw	0.684	0.652
te	0.813	0.793
th	0.75	0.697
yo	-	0.124
zh	0.459	0.375

Regards,
Nandan

merge latest mteb branch with main

thakur-nandan · 2024-05-06T23:19:05Z

I'm not sure whether the languages considered in MIRACL are new to be considered for a bonus.

Nevertheless, I have added 2 points for adding the MIRACL dataset.

Hope it helps!

imenelydiaker

All good! Thank you for this great addition, below my comments 🙂

imenelydiaker · 2024-05-07T08:19:30Z

mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py

-_LANGS = {"de": ["deu-Latn"], "es": ["spa-Latn"]}
+_EVAL_SPLIT = "dev"
+
+_LANGUAGES = {


fin, ind, swa and yor are new languages, you gain 4*4 bonus points!

imenelydiaker · 2024-05-07T08:21:36Z

mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py

        n_samples=None,
        avg_character_length=None,


Here you can sum up the corpus lengths for all languages and give their average character length

imenelydiaker · 2024-05-07T08:22:31Z

mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py

+    "hi": ["hin-Deva"],
+    "id": ["ind-Latn"],
+    "ja": ["jpn-Jpan"],
+    "ko": ["kor-Kore"],


You can remove the Korean Miracl task file since it's included here

imenelydiaker · 2024-05-07T08:22:41Z

docs/mmteb/points/642.jsonl

@@ -0,0 +1 @@
+{"GitHub": "thakur-nandan", "New dataset": 2}


Suggested change

{"GitHub": "thakur-nandan", "New dataset": 2}

{"GitHub": "thakur-nandan", "New dataset": 18}

Andrian0s · 2024-05-07T08:30:08Z

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

@thakur-nandan if u have known results for a non e5 model, can you rerun with that and confirm that the discrepancy is atleast smaller then?

imenelydiaker · 2024-05-07T12:19:51Z

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Andrian0s · 2024-05-07T12:25:05Z

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

imenelydiaker · 2024-05-07T13:00:58Z

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

I guess that if we're not appending the correct information to the evaluator then the issue is not only with E5, but with other models also? It would be nice if you can open an issue with your observation, I'll take a look at it then and try to fix it

…ce on miracl and other datasets

thakur-nandan · 2024-05-07T16:14:42Z

@Muennighoff @KennethEnevoldsen @imenelydiaker I found out an issue with the retrieval dataset evaluation is that the query_id, doc_id are always explicitly removed if they are the same. This was introduced in BEIR to avoid self-retrieval in Quora and ArguAna. But, this is leading to lower performances on MIRACL.

After including the following changes, I'm running mContriever scores on MIRACL retrieval for all languages and checking them. A quick evaluation on Yoruba I can achieve 0.4182 nDCG@10 with MTEB (original reported: 0.415).

KennethEnevoldsen · 2024-05-08T11:35:11Z

@thakur-nandan is seems like this PR will influence the score of other tasks, which might be problematic for comparisons. @Muennighoff what is the best approach here?

I see two potential solutions:

Updating the scores on Quora and ArguAna to utilise the new score or do It for MIRACL (this seems problematic for comparison)
Alternatively solution is to use both score nDCG@10 and nDCG@10(no self retrieval) (I believe this approach is best)

Muennighoff · 2024-05-08T15:53:01Z

I think @thakur-nandan probably knows best how to reconcile it with Quora & ArguAna as he created them?
The 2nd approach sounds good to me.

imenelydiaker · 2024-05-09T17:36:35Z

mteb/evaluation/evaluators/RetrievalEvaluator.py

+                    if len(result_heaps[query_id]) < top_k:
+                        # Push item on the heap
+                        heapq.heappush(result_heaps[query_id], (score, corpus_id))
+                    else:
+                        # If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
+                        heapq.heappushpop(
+                            result_heaps[query_id], (score, corpus_id)
+                        )


According to the previous discussion and my understanding, I think that we can't really remove this condition since it was introduced by BEIR for Quora & ArguAna that are included in MTEB.

In this case, I'd say that it would be better to have a flag to activate/deactivate the self retrieval, so that the evaluator can also be used with MIRACL. This would mean that we would keep both scores as @KennethEnevoldsen suggested.

Else, if removing self retrieval and running again Quora and ArguAna makes sense then we should do it. Let me know what you think @thakur-nandan ?

thakur-nandan · 2024-05-10T16:59:33Z

Thanks for checking this PR.

So, the scores will not be affected as the self-retrieval is double-checked during evaluation as well here with the flag ignore_identical_ids set to True, which is the desirable way to go.

mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py

Line 427 in 0cf33d7

if ignore_identical_ids:

Hence, AFAIK we can safely remove the line if corpus_id != query_id: that is included in the PR @imenelydiaker @KennethEnevoldsen.

I have two suggestions here:

(1) Keep the code as is with ignore_identical_ids=True but inform users to keep the query_ids and document_ids are distinct from each other, e.g. for MIRACL I pass ignore_identical_ids=False.
(2) Change the default to ignore_identical_ids=False, however, make sure to either hard-code it or remind authors to keep changing the ignore_identical_ids=True for ArguAna and Quora in BEIR.

Since you are the PR reviewers: The veto power lies with you and I'll let you all decide: @Muennighoff @KennethEnevoldsen @imenelydiaker.

Thanks,
Nandan

KennethEnevoldsen · 2024-05-11T13:41:07Z

@thakur-nandan I believe option 2 is the desirable option. Though I would not want the user to switch it. Instead, I would a) create two separate scores (one with and one without) or b) allow the argument to be overwritten during dataset construction:

class ArguAna(AbsTaskRetrieval):
    ignore_identical_ids=True

    metadata = TaskMetadata(
        name="ArguAna",
        ...
        )

Either approach you want to implement is fine with me, but I would probably prefer a) (however I will accept either if one is easier to implement go for that one).

imenelydiaker · 2024-05-14T20:48:49Z

@thakur-nandan we would go for option 2 as @KennethEnevoldsen mentioned it, we would love your help on this! 🙂

KennethEnevoldsen · 2024-05-21T09:51:18Z

@thakur-nandan I would love to get this PR merged in as soon as possible. Would you have the time to do this?

thakur-nandan · 2024-05-21T13:10:18Z

Hi @KennethEnevoldsen @imenelydiaker thanks for your suggestions on the topic, I'll start with the 2 (a) suggestion of keeping separate scores for nDCG@10 with and without self-retrieval. I didn't get time recently to have a look at the PR. I will try to get it done by tomorrow's EoD.

Regards,
Nandan

KennethEnevoldsen · 2024-05-21T14:04:51Z

Wonderful to hear @thakur-nandan! Will keep an eye out for it such that the review is resolved quickly

KennethEnevoldsen · 2024-05-27T12:53:16Z

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

imenelydiaker · 2024-05-27T13:05:41Z

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

Will do yes!

thakur-nandan · 2024-05-27T13:56:58Z

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

imenelydiaker · 2024-05-27T14:10:45Z

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

Thank you @thakur-nandan for this great work, we'll finish it up! 🙂

imenelydiaker · 2024-05-27T14:16:45Z

Merging as in #641.

thakur-nandan added 6 commits May 2, 2024 14:38

added initial MIRACL Retrieval for yoruba

8d122d0

Merge pull request #1 from embeddings-benchmark/main

3b268fd

merge latest mteb branch with main

updated langs to take self.langs

f774ad1

Merge branch 'embeddings-benchmark:main' into main

9b0aee2

Merge branch 'main' of https://github.com/thakur-nandan/mteb

37cceb8

make lint

655c965

thakur-nandan mentioned this pull request May 6, 2024

Add MIRACL #198

Open

update points

0dccac4

imenelydiaker reviewed May 7, 2024

View reviewed changes

thakur-nandan added 2 commits May 7, 2024 07:04

Merge branch 'embeddings-benchmark:main' into main

c525ac5

change the explicit always remove doc_id == query_id, harms performan…

396fde5

…ce on miracl and other datasets

KennethEnevoldsen assigned KennethEnevoldsen and imenelydiaker and unassigned KennethEnevoldsen May 8, 2024

imenelydiaker reviewed May 9, 2024

View reviewed changes

KennethEnevoldsen mentioned this pull request May 22, 2024

Finalizing MMTEB #784

Open

4 tasks

imenelydiaker mentioned this pull request May 24, 2024

Question about Adding Datasets #802

Closed

Ruqyai mentioned this pull request May 25, 2024

Adding a new Arabic dataset #819

Closed

10 tasks

thakur-nandan added 3 commits May 27, 2024 09:31

resolved merge issues

895d283

Merge branch 'embeddings-benchmark-main'

c10aa7c

added the nDCG@10 self metric

e946f57

imenelydiaker changed the base branch from main to miracl-retrieval May 27, 2024 14:15

imenelydiaker merged commit 99f301a into embeddings-benchmark:miracl-retrieval May 27, 2024

imenelydiaker mentioned this pull request May 27, 2024

fix: Add MIRACL retrieval #833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding MIRACL Retrieval #642

Adding MIRACL Retrieval #642

thakur-nandan commented May 6, 2024 •

edited by imenelydiaker

thakur-nandan commented May 6, 2024 •

edited

imenelydiaker left a comment

imenelydiaker May 7, 2024

imenelydiaker May 7, 2024

imenelydiaker May 7, 2024

imenelydiaker May 7, 2024

Andrian0s commented May 7, 2024 •

edited

imenelydiaker commented May 7, 2024

Andrian0s commented May 7, 2024

imenelydiaker commented May 7, 2024 •

edited

thakur-nandan commented May 7, 2024 •

edited

KennethEnevoldsen commented May 8, 2024

Muennighoff commented May 8, 2024

imenelydiaker May 9, 2024 •

edited

thakur-nandan commented May 10, 2024 •

edited

KennethEnevoldsen commented May 11, 2024

imenelydiaker commented May 14, 2024

KennethEnevoldsen commented May 21, 2024

thakur-nandan commented May 21, 2024 •

edited

KennethEnevoldsen commented May 21, 2024

KennethEnevoldsen commented May 27, 2024

imenelydiaker commented May 27, 2024

thakur-nandan commented May 27, 2024

imenelydiaker commented May 27, 2024

imenelydiaker commented May 27, 2024

	{"GitHub": "thakur-nandan", "New dataset": 2}
	{"GitHub": "thakur-nandan", "New dataset": 18}

Adding MIRACL Retrieval #642

Adding MIRACL Retrieval #642

Conversation

thakur-nandan commented May 6, 2024 • edited by imenelydiaker

Checklist for adding MMTEB dataset

thakur-nandan commented May 6, 2024 • edited

imenelydiaker left a comment

Choose a reason for hiding this comment

imenelydiaker May 7, 2024

Choose a reason for hiding this comment

imenelydiaker May 7, 2024

Choose a reason for hiding this comment

imenelydiaker May 7, 2024

Choose a reason for hiding this comment

imenelydiaker May 7, 2024

Choose a reason for hiding this comment

Andrian0s commented May 7, 2024 • edited

imenelydiaker commented May 7, 2024

Andrian0s commented May 7, 2024

imenelydiaker commented May 7, 2024 • edited

thakur-nandan commented May 7, 2024 • edited

KennethEnevoldsen commented May 8, 2024

Muennighoff commented May 8, 2024

imenelydiaker May 9, 2024 • edited

Choose a reason for hiding this comment

thakur-nandan commented May 10, 2024 • edited

I have two suggestions here:

KennethEnevoldsen commented May 11, 2024

imenelydiaker commented May 14, 2024

KennethEnevoldsen commented May 21, 2024

thakur-nandan commented May 21, 2024 • edited

KennethEnevoldsen commented May 21, 2024

KennethEnevoldsen commented May 27, 2024

imenelydiaker commented May 27, 2024

thakur-nandan commented May 27, 2024

imenelydiaker commented May 27, 2024

imenelydiaker commented May 27, 2024

thakur-nandan commented May 6, 2024 •

edited by imenelydiaker

thakur-nandan commented May 6, 2024 •

edited

Andrian0s commented May 7, 2024 •

edited

imenelydiaker commented May 7, 2024 •

edited

thakur-nandan commented May 7, 2024 •

edited

imenelydiaker May 9, 2024 •

edited

thakur-nandan commented May 10, 2024 •

edited

thakur-nandan commented May 21, 2024 •

edited