GPT4-o generated queries for 14 languages #718

rasdani · 2024-05-15T01:21:06Z

Checklist for adding MMTEB dataset

Reason for dataset addition:
Succinct queries generated by a strong multilingual LLM grounded in Wikipedia articles nicely chunked by Cohere should be a strict improvement over a lot of machine translated versions of SQuAD in different languages.
Wikipedia is probably the highest quality (available) corpus in most languages.
see #378

WIP and I am running query generation over night for the remaining 12 languages on this list:

LANG_MAP = {
    "de": "German",
    "bn": "Bengali",
    "it": "Italian",
    "pt": "Portuguese",
    "ru": "Russian",
    "uk": "Ukrainian",
    "nl": "Dutch",
    "cs": "Czech",
    "ro": "Romanian",
    "bg": "Bulgarian",
    "sr": "Serbian",
    "fi": "Finnish",
    "fa": "Persian",
    "hi": "Hindi",
}

Draft PR for early feedback. @KennethEnevoldsen @Muennighoff happy to hear any suggestions :)

rasdani · 2024-05-15T01:22:12Z

Generated with this prompt and temperature=0.0, max_tokens=512.

Your task is to anticipate possible search queries by users in the form of a question for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the information from the given document
- The question must be coherent and should make sense without knowing the document
- The question must be answerable by the document
- The question should focus on one aspect and avoid using subclauses connected with 'and'
- The question should not be overly specific and should mimic a request of a user who is just starting to research the given topic
- Do not draw on your prior knowledge

Generate a question in {{ language }} for the following document:
<document>
{{ document }}
</document>

Search query:

During generation "{title}\n\n" was prepended to the chunk.

Query quality was inspected manually by native speakers in German and Bengali.

rasdani · 2024-05-15T08:48:57Z

I calculated recent log views according to https://huggingface.co/datasets/Cohere/wikipedia-22-12 and applied them to https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

Per language, I filtered out articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed articles.

I selected a random window of 9 consecutive paragraphs per article and choose the middle one to be the positive context and generated a query for it with gpt-4o.
The surrounding 8 paragraphs act as hard negatives and have a score of 0.5 in the qrels dataset.

The 9 paragraphs per article are used for the reranking task with one positive and 8 negatives.
The one positive, 8 hard negatives and the remaining corpus as negatives are used in the retrieval task.

The choice of hard negatives is debatable. I could prepend "{title}\n\n" to the chunks or add more random (true) negatives to the reranking negatives.
As it is now, the German reranking task looks too easy, but the Bengali one is fine.

mteb/tasks/Reranking/ben/WikepediaRerankingBN.py

rasdani · 2024-05-15T22:34:34Z

rasdani · 2024-05-16T11:34:45Z

I'm trying to understand this paragraph from the points documentation.

The first dataset for a language x task gains 4 bonus points. If the number of new languages is >= 12 then points for that PR for a new dataset are capped at 50 (12 * 4 + 2 = 48 + 2 = 50).

Not all of my languages are new, so strictly speaking the cap does not apply?
These are the added languages:

languages = [ "de", "bn", "it", "pt", "nl", "cs", "ro", "bg", "sr", "fi", "fa", "hi", "da", "en"]

For these languages the retrieval task is the first of its kind:

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

So these would give 11 * 4 = 44 points.

["de", "bn", "en"]

Are new datasets, but already have a retrieval task. So 3 * 2 = 6.

For theWikipediaRerankingMultilingual task, I pulled all the languages together into a single dataset and there already exists a mulitlingual reranking task, so 2 points.

This would result in 52 points for this PR.

Or are the 4 bonus points meant to be added on top of the 2 points per dataset?
This would result in 11 * (4+2) + 6 + 2 = 74 points.

EDIT:
points.md suggests that the bonus points are added on top per dataset.

{
    "GitHub": "GitHubUser1",
    "New dataset": 2-6,  # 2 points for the dataset and 4 points for the task
    "New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
    "Dataset annotations": 1, # 1 point for each full dataset annotation
    "Bug fixes": 2-10, # depends on the complexity of the fix
    "Running Models": 1, # pr model run
    "Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
    "Paper Writing": NA, 
    "Ideation": NA,
    "Coordination": NA
}

EDIT2:
If so, then my points for #197 need to be updated from 4 -> 6.
Can we arrange for my coworkers and me to appear next to each other as coauthors? I can make slight adjustments to the current PR points, if needed.

mteb/tasks/Retrieval/ben/WikipediaRetrievalBN.py

KennethEnevoldsen · 2024-05-17T08:22:01Z

Related to points:

I would calculate it as follows:

For the Retrieval dataset (which I do think should be combined):

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

is 11 * 4 = 44 and then add 2 for the dataset. So that is 46.

For the reranking you get the scores pr. language as well:

EVAL_LANGS = {
     "bg": ["bul-Cyrl"],
     "bn": ["ben-Beng"],
     "cs": ["ces-Latn"],
     "da": ["dan-Latn"], 
     "de": ["deu-Latn"], # has rerankin
     "en": ["eng-Latn"], # has reranking
     "fa": ["fas-Arab"],
     "fi": ["fin-Latn"],
     "hi": ["hin-Deva"],
     "it": ["ita-Latn"],
     "nl": ["nld-Latn"],
     "pt": ["por-Latn"],
     "ro": ["ron-Latn"],
     "sr": ["srp-Cyrl"],
 }

So that would be 2 + 4*12 = 48+2 = 50.

So in total, you get 50+48=98.

You can still max out the bonus for both datasets by adding 3 more languages (up to you if you feel like it is worth it). I you want to do that I can review Swedish and Norwegian as well.

rasdani · 2024-05-17T09:16:42Z

Great, thanks for reviewing! :)
Points are more than enough, but since I already have wikipedia-no ready, I can add that as well.

Could you give me a hint on how best to upload multilingual datasets? Right now I have the languages as dataset configs, which show up as the subset drop down menu on HF hub. Passing the languages as eval_langs= to a task did not work for me.
I dug deeper into the code base and the only thing I came up with was to add a config= kwarg at the point where the dataset gets actually loaded. But since this is in core MTEB, I thought there must be another way on task level.

x-tabdeveloping · 2024-05-17T09:26:40Z

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py

KennethEnevoldsen · 2024-05-17T09:39:32Z

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

@x-tabdeveloping we had the discussion in an issue beforehand. I believe the quality is good enough to warrant inclusion (def. better than e.g. retrieval based on articles headlines I would argue). That being said it might introduce odd biases. We can def. examine if that is the case once we start running models.

rasdani · 2024-05-17T09:52:32Z

Machine translated ones often translate whole passages and not all translation services are good.
In the current dataset passages are human written and sampled from top sites by page views. Only short queries are generated with the strongest currenly available multilingual LLM (gpt4-o).

Generating with temperature 0 and the current prompt basically just 'rephrases' the provided human written document to a single, succinct question.

KennethEnevoldsen · 2024-05-17T09:58:15Z

@rasdani, and @x-tabdeveloping this does raise an interesting point within the discussion section of the paper: can datasets such as these approximate the performance of high quality datasets. E.g. a comparison between MIRACL and these seems reasonable.

btw. @rasdani seems like the tests fail will you have a look at it (it seems like it is due to the mock test overwriting datasets concatenate method)

KennethEnevoldsen · 2024-05-21T10:04:47Z

@rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.

rasdani · 2024-05-21T10:46:35Z

yes, I will! tonight or tomorrow night Gesendet von Outlook für iOS<https://aka.ms/o0ukef>

…

________________________________ Von: Kenneth Enevoldsen ***@***.***> Gesendet: Tuesday, May 21, 2024 12:05:09 PM An: embeddings-benchmark/mteb ***@***.***> Cc: Daniel Auras ***@***.***>; Mention ***@***.***> Betreff: Re: [embeddings-benchmark/mteb] GPT4-o generated queries for 14 languages (PR #718) @rasdani<https://github.com/rasdani> I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge. — Reply to this email directly, view it on GitHub<#718 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARRH3HQPKUGZRZUHCI3NPFLZDML5LAVCNFSM6AAAAABHXGCXBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGI2TMNBZHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

rasdani · 2024-05-21T22:13:04Z

I added "no" and "sv":
https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-no-queries
https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-sv-queries

I managed to fix the MultilingalReranking and added results.
However, I'm stuck with some missing import for the MultilingualRetrieval task (Task not found on terminal) and I'm hitting HuggingFace upload limits for the multilingual retrieval dataset.

Will try to finish up tomorrow. If you can spot, whether I'm missing an import somewhere, please let me know.

When the MultilingualRetrieval works, I will delete the language specific retrieval tasks.

KennethEnevoldsen

Check the Norwegian and Swedish both look reasonable!

docs/mmteb/points.md

mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py

rasdani · 2024-05-22T20:19:22Z

I modified the points for this current PR such that we all end up with the same number of total points, if I account for the (corrected) 6 points of my old PR.
This way we should end up next to each other on the paper.

{"GitHub": "rasdani", "New dataset": 20}
{"GitHub": "ShawonAshraf", "New dataset": 26}
{"GitHub": "bjoernpl", "New dataset": 26}
{"GitHub": "jphme", "New dataset": 26}
{"GitHub": "KennethEnevoldsen", "Review PR": 2}

rasdani · 2024-05-22T20:25:15Z

mteb/tasks/Retrieval/multilingual/WikipediaRetrievalMultilingual.py

+        qrels_lang_dict = {}
+        for qrel in qrels_lang:
+            if qrel["score"] == 0.5:
+                continue


Unfortunately, I couldn't get WikiepediaRetrievalMultilingual running with the hard negatives. I assumed one can just set scores inbetween 0 and 1 in the qrels dataset, e.g. 0.5 for hard negatives.
https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-qrels

But MTEB expects int at some point so I implemented a workaround that just drops all the hard negatives. Would be nice, if we don't waste them.

I would create an issue on this, then we can come back to it later

) * first proper upload of wikipedia-retrieval dataset * update license and README of dataset * fix test split and add WikipediaRetrievalDE task * add WikipediaRerankingDE task * add Bengali tasks * multilingual reranking dataset * add Multilingual Reranking * add Retrieval tasks * update metadata for Reranking task * run make lint * fix metadata validation errors * delete German and Bengali Reranking tasks * fix more task metadata, tests passing now * add retrieval results * WIP: reranking with multilingual dataset * undo changes to run script * update points and contributor info * subcall MultilingualTask for reranking task and add reranking results * WIP: make retrieval a multilingual dataset, too * WIP: first run of WikipediaRetrievalMultilinugal * add WikipediaRetrievalMultilinugal task and results * delete language specific retrieval tasks and results * update points and add Openreview IDs * make lint * remove debugging print statement

rasdani added 5 commits May 14, 2024 22:43

first proper upload of wikipedia-retrieval dataset

0b8ea7c

update license and README of dataset

11e01ba

fix test split and add WikipediaRetrievalDE task

f8a1357

add WikipediaRerankingDE task

0c0cb36

add Bengali tasks

696e1b4

KennethEnevoldsen reviewed May 15, 2024

View reviewed changes

mteb/tasks/Reranking/ben/WikepediaRerankingBN.py Outdated Show resolved Hide resolved

rasdani added 10 commits May 15, 2024 19:38

multilingual reranking dataset

d11b753

add Multilingual Reranking

5425433

add Retrieval tasks

10acfee

update metadata for Reranking task

1d45f97

run make lint

d945b0f

fix metadata validation errors

9ae65cf

delete German and Bengali Reranking tasks

484e487

fix more task metadata, tests passing now

ae92473

add retrieval results

ddd0be3

WIP: reranking with multilingual dataset

e8ac0ce

rasdani marked this pull request as ready for review May 15, 2024 22:24

rasdani requested a review from KennethEnevoldsen May 15, 2024 22:45

undo changes to run script

d2567f8

rasdani added 2 commits May 16, 2024 12:08

update points and contributor info

66b8031

Merge branch 'main' into mmteb-contrib

7732111

KennethEnevoldsen reviewed May 17, 2024

View reviewed changes

mteb/tasks/Retrieval/ben/WikipediaRetrievalBN.py Outdated Show resolved Hide resolved

KennethEnevoldsen reviewed May 17, 2024

View reviewed changes

mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py Outdated Show resolved Hide resolved

rasdani added 2 commits May 21, 2024 19:50

subcall MultilingualTask for reranking task and add reranking results

d4c8b01

WIP: make retrieval a multilingual dataset, too

acc13db

KennethEnevoldsen approved these changes May 22, 2024

View reviewed changes

docs/mmteb/points.md Outdated Show resolved Hide resolved

mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py Show resolved Hide resolved

KennethEnevoldsen mentioned this pull request May 22, 2024

Finalizing MMTEB #784

Open

4 tasks

rasdani added 5 commits May 22, 2024 19:27

WIP: first run of WikipediaRetrievalMultilinugal

a792c4b

add WikipediaRetrievalMultilinugal task and results

69ea343

delete language specific retrieval tasks and results

f7ca9ee

update points and add Openreview IDs

0bfe20a

Merge branch 'main' into mmteb-contrib

f334e34

make lint

595f6cb

rasdani commented May 22, 2024

View reviewed changes

remove debugging print statement

1f4ea94

KennethEnevoldsen merged commit 411e232 into embeddings-benchmark:main May 23, 2024
7 checks passed

rasdani mentioned this pull request May 23, 2024

Add Norwegian and Swedish to WikipediaRerankingMultilingual and update points #796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT4-o generated queries for 14 languages #718

GPT4-o generated queries for 14 languages #718

rasdani commented May 15, 2024 •

edited

rasdani commented May 15, 2024 •

edited

rasdani commented May 15, 2024 •

edited

rasdani commented May 15, 2024

rasdani commented May 16, 2024 •

edited

KennethEnevoldsen commented May 17, 2024 •

edited

rasdani commented May 17, 2024 •

edited

x-tabdeveloping commented May 17, 2024

KennethEnevoldsen commented May 17, 2024

rasdani commented May 17, 2024

KennethEnevoldsen commented May 17, 2024

KennethEnevoldsen commented May 21, 2024

rasdani commented May 21, 2024 via email

rasdani commented May 21, 2024 •

edited

KennethEnevoldsen left a comment

rasdani commented May 22, 2024 •

edited

rasdani May 22, 2024

KennethEnevoldsen May 23, 2024

GPT4-o generated queries for 14 languages #718

GPT4-o generated queries for 14 languages #718

Conversation

rasdani commented May 15, 2024 • edited

Checklist for adding MMTEB dataset

rasdani commented May 15, 2024 • edited

rasdani commented May 15, 2024 • edited

rasdani commented May 15, 2024

rasdani commented May 16, 2024 • edited

KennethEnevoldsen commented May 17, 2024 • edited

rasdani commented May 17, 2024 • edited

x-tabdeveloping commented May 17, 2024

KennethEnevoldsen commented May 17, 2024

rasdani commented May 17, 2024

KennethEnevoldsen commented May 17, 2024

KennethEnevoldsen commented May 21, 2024

rasdani commented May 21, 2024 via email

rasdani commented May 21, 2024 • edited

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

rasdani commented May 22, 2024 • edited

rasdani May 22, 2024

Choose a reason for hiding this comment

KennethEnevoldsen May 23, 2024

Choose a reason for hiding this comment

rasdani commented May 15, 2024 •

edited

rasdani commented May 15, 2024 •

edited

rasdani commented May 15, 2024 •

edited

rasdani commented May 16, 2024 •

edited

KennethEnevoldsen commented May 17, 2024 •

edited

rasdani commented May 17, 2024 •

edited

rasdani commented May 21, 2024 •

edited

rasdani commented May 22, 2024 •

edited