Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT4-o generated queries for 14 languages #718

Merged

Conversation

rasdani
Copy link
Contributor

@rasdani rasdani commented May 15, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition:
Succinct queries generated by a strong multilingual LLM grounded in Wikipedia articles nicely chunked by Cohere should be a strict improvement over a lot of machine translated versions of SQuAD in different languages.
Wikipedia is probably the highest quality (available) corpus in most languages.
see #378

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

WIP and I am running query generation over night for the remaining 12 languages on this list:

LANG_MAP = {
    "de": "German",
    "bn": "Bengali",
    "it": "Italian",
    "pt": "Portuguese",
    "ru": "Russian",
    "uk": "Ukrainian",
    "nl": "Dutch",
    "cs": "Czech",
    "ro": "Romanian",
    "bg": "Bulgarian",
    "sr": "Serbian",
    "fi": "Finnish",
    "fa": "Persian",
    "hi": "Hindi",
}

Draft PR for early feedback. @KennethEnevoldsen @Muennighoff happy to hear any suggestions :)

@rasdani
Copy link
Contributor Author

rasdani commented May 15, 2024

Generated with this prompt and temperature=0.0, max_tokens=512.

Your task is to anticipate possible search queries by users in the form of a question for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the information from the given document
- The question must be coherent and should make sense without knowing the document
- The question must be answerable by the document
- The question should focus on one aspect and avoid using subclauses connected with 'and'
- The question should not be overly specific and should mimic a request of a user who is just starting to research the given topic
- Do not draw on your prior knowledge

Generate a question in {{ language }} for the following document:
<document>
{{ document }}
</document>

Search query:

During generation "{title}\n\n" was prepended to the chunk.

Query quality was inspected manually by native speakers in German and Bengali.

@rasdani
Copy link
Contributor Author

rasdani commented May 15, 2024

I calculated recent log views according to https://huggingface.co/datasets/Cohere/wikipedia-22-12 and applied them to https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

Per language, I filtered out articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed articles.

I selected a random window of 9 consecutive paragraphs per article and choose the middle one to be the positive context and generated a query for it with gpt-4o.
The surrounding 8 paragraphs act as hard negatives and have a score of 0.5 in the qrels dataset.

The 9 paragraphs per article are used for the reranking task with one positive and 8 negatives.
The one positive, 8 hard negatives and the remaining corpus as negatives are used in the retrieval task.

The choice of hard negatives is debatable. I could prepend "{title}\n\n" to the chunks or add more random (true) negatives to the reranking negatives.
As it is now, the German reranking task looks too easy, but the Bengali one is fine.

@rasdani rasdani marked this pull request as ready for review May 15, 2024 22:24
@rasdani
Copy link
Contributor Author

rasdani commented May 15, 2024

How do I run reranking on a multilingual dataset? I now have the different languages as subsets in https://huggingface.co/datasets/ellamind/wikipedia-2023-11-reranking-multilingual.

But I don't see a way to specify config= in a task. I don't think I can add multiple languages as splits.

Except for the one WikipediaRerankingMultilingual task, I can tick almost all boxes:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

For 11 languages, we have the first retrieval task. So with 11*4 and the other points we already hit the cap of 50 points. Am I correct?

@rasdani
Copy link
Contributor Author

rasdani commented May 16, 2024

I'm trying to understand this paragraph from the points documentation.

The first dataset for a language x task gains 4 bonus points. If the number of new languages is >= 12 then points for that PR for a new dataset are capped at 50 (12 * 4 + 2 = 48 + 2 = 50).

Not all of my languages are new, so strictly speaking the cap does not apply?
These are the added languages:

languages = [ "de", "bn", "it", "pt", "nl", "cs", "ro", "bg", "sr", "fi", "fa", "hi", "da", "en"]

For these languages the retrieval task is the first of its kind:

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

So these would give 11 * 4 = 44 points.

["de", "bn", "en"]

Are new datasets, but already have a retrieval task. So 3 * 2 = 6.

For theWikipediaRerankingMultilingual task, I pulled all the languages together into a single dataset and there already exists a mulitlingual reranking task, so 2 points.

This would result in 52 points for this PR.

Or are the 4 bonus points meant to be added on top of the 2 points per dataset?
This would result in 11 * (4+2) + 6 + 2 = 74 points.

EDIT:
points.md suggests that the bonus points are added on top per dataset.

{
    "GitHub": "GitHubUser1",
    "New dataset": 2-6,  # 2 points for the dataset and 4 points for the task
    "New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
    "Dataset annotations": 1, # 1 point for each full dataset annotation
    "Bug fixes": 2-10, # depends on the complexity of the fix
    "Running Models": 1, # pr model run
    "Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
    "Paper Writing": NA, 
    "Ideation": NA,
    "Coordination": NA
}

EDIT2:
If so, then my points for #197 need to be updated from 4 -> 6.
Can we arrange for my coworkers and me to appear next to each other as coauthors? I can make slight adjustments to the current PR points, if needed.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 17, 2024

Related to points:

I would calculate it as follows:

For the Retrieval dataset (which I do think should be combined):

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

is 11 * 4 = 44 and then add 2 for the dataset. So that is 46.

For the reranking you get the scores pr. language as well:

EVAL_LANGS = {
     "bg": ["bul-Cyrl"],
     "bn": ["ben-Beng"],
     "cs": ["ces-Latn"],
     "da": ["dan-Latn"], 
     "de": ["deu-Latn"], # has rerankin
     "en": ["eng-Latn"], # has reranking
     "fa": ["fas-Arab"],
     "fi": ["fin-Latn"],
     "hi": ["hin-Deva"],
     "it": ["ita-Latn"],
     "nl": ["nld-Latn"],
     "pt": ["por-Latn"],
     "ro": ["ron-Latn"],
     "sr": ["srp-Cyrl"],
 }

So that would be 2 + 4*12 = 48+2 = 50.

So in total, you get 50+48=98.

You can still max out the bonus for both datasets by adding 3 more languages (up to you if you feel like it is worth it). I you want to do that I can review Swedish and Norwegian as well.

@rasdani
Copy link
Contributor Author

rasdani commented May 17, 2024

Great, thanks for reviewing! :)
Points are more than enough, but since I already have wikipedia-no ready, I can add that as well.

Could you give me a hint on how best to upload multilingual datasets? Right now I have the languages as dataset configs, which show up as the subset drop down menu on HF hub. Passing the languages as eval_langs= to a task did not work for me.
I dug deeper into the code base and the only thing I came up with was to add a config= kwarg at the point where the dataset gets actually loaded. But since this is in core MTEB, I thought there must be another way on task level.

@x-tabdeveloping
Copy link
Contributor

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

@KennethEnevoldsen
Copy link
Contributor

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

@x-tabdeveloping we had the discussion in an issue beforehand. I believe the quality is good enough to warrant inclusion (def. better than e.g. retrieval based on articles headlines I would argue). That being said it might introduce odd biases. We can def. examine if that is the case once we start running models.

@rasdani
Copy link
Contributor Author

rasdani commented May 17, 2024

Machine translated ones often translate whole passages and not all translation services are good.
In the current dataset passages are human written and sampled from top sites by page views. Only short queries are generated with the strongest currenly available multilingual LLM (gpt4-o).

Generating with temperature 0 and the current prompt basically just 'rephrases' the provided human written document to a single, succinct question.

@KennethEnevoldsen
Copy link
Contributor

@rasdani, and @x-tabdeveloping this does raise an interesting point within the discussion section of the paper: can datasets such as these approximate the performance of high quality datasets. E.g. a comparison between MIRACL and these seems reasonable.

btw. @rasdani seems like the tests fail will you have a look at it (it seems like it is due to the mock test overwriting datasets concatenate method)

@KennethEnevoldsen
Copy link
Contributor

@rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.

@rasdani
Copy link
Contributor Author

rasdani commented May 21, 2024 via email

@rasdani
Copy link
Contributor Author

rasdani commented May 21, 2024

I added "no" and "sv":
https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-no-queries
https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-sv-queries

I managed to fix the MultilingalReranking and added results.
However, I'm stuck with some missing import for the MultilingualRetrieval task (Task not found on terminal) and I'm hitting HuggingFace upload limits for the multilingual retrieval dataset.

Will try to finish up tomorrow. If you can spot, whether I'm missing an import somewhere, please let me know.

When the MultilingualRetrieval works, I will delete the language specific retrieval tasks.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the Norwegian and Swedish both look reasonable!

docs/mmteb/points.md Outdated Show resolved Hide resolved
@KennethEnevoldsen KennethEnevoldsen mentioned this pull request May 22, 2024
4 tasks
@rasdani
Copy link
Contributor Author

rasdani commented May 22, 2024

I modified the points for this current PR such that we all end up with the same number of total points, if I account for the (corrected) 6 points of my old PR.
This way we should end up next to each other on the paper.

{"GitHub": "rasdani", "New dataset": 20}
{"GitHub": "ShawonAshraf", "New dataset": 26}
{"GitHub": "bjoernpl", "New dataset": 26}
{"GitHub": "jphme", "New dataset": 26}
{"GitHub": "KennethEnevoldsen", "Review PR": 2}

qrels_lang_dict = {}
for qrel in qrels_lang:
if qrel["score"] == 0.5:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I couldn't get WikiepediaRetrievalMultilingual running with the hard negatives. I assumed one can just set scores inbetween 0 and 1 in the qrels dataset, e.g. 0.5 for hard negatives.
https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-qrels

But MTEB expects int at some point so I implemented a workaround that just drops all the hard negatives. Would be nice, if we don't waste them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create an issue on this, then we can come back to it later

@KennethEnevoldsen KennethEnevoldsen merged commit 411e232 into embeddings-benchmark:main May 23, 2024
7 checks passed
dokato pushed a commit to dokato/mteb that referenced this pull request May 24, 2024
)

* first proper upload of wikipedia-retrieval dataset

* update license and README of dataset

* fix test split and add WikipediaRetrievalDE task

* add WikipediaRerankingDE task

* add Bengali tasks

* multilingual reranking dataset

* add Multilingual Reranking

* add Retrieval tasks

* update metadata for Reranking task

* run make lint

* fix metadata validation errors

* delete German and Bengali Reranking tasks

* fix more task metadata, tests passing now

* add retrieval results

* WIP: reranking with multilingual dataset

* undo changes to run script

* update points and contributor info

* subcall MultilingualTask for reranking task and add reranking results

* WIP: make retrieval a multilingual dataset, too

* WIP: first run of WikipediaRetrievalMultilinugal

* add WikipediaRetrievalMultilinugal task and results

* delete language specific retrieval tasks and results

* update points and add Openreview IDs

* make lint

* remove debugging print statement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants