-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT4-o generated queries for 14 languages #718
GPT4-o generated queries for 14 languages #718
Conversation
Generated with this prompt and
During generation Query quality was inspected manually by native speakers in German and Bengali. |
I calculated recent log views according to https://huggingface.co/datasets/Cohere/wikipedia-22-12 and applied them to https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3. Per language, I filtered out articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed articles. I selected a random window of 9 consecutive paragraphs per article and choose the middle one to be the positive context and generated a query for it with The 9 paragraphs per article are used for the reranking task with one positive and 8 negatives. The choice of hard negatives is debatable. I could prepend |
How do I run reranking on a multilingual dataset? I now have the different languages as subsets in https://huggingface.co/datasets/ellamind/wikipedia-2023-11-reranking-multilingual. But I don't see a way to specify Except for the one
For 11 languages, we have the first retrieval task. So with 11*4 and the other points we already hit the cap of 50 points. Am I correct? |
I'm trying to understand this paragraph from the points documentation.
Not all of my languages are new, so strictly speaking the cap does not apply? languages = [ "de", "bn", "it", "pt", "nl", "cs", "ro", "bg", "sr", "fi", "fa", "hi", "da", "en"] For these languages the retrieval task is the first of its kind: ["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"] So these would give 11 * 4 = 44 points. ["de", "bn", "en"] Are new datasets, but already have a retrieval task. So 3 * 2 = 6. For the This would result in 52 points for this PR. Or are the 4 bonus points meant to be added on top of the 2 points per dataset? EDIT:
EDIT2: |
Related to points: I would calculate it as follows: For the Retrieval dataset (which I do think should be combined):
is 11 * 4 = 44 and then add 2 for the dataset. So that is 46. For the reranking you get the scores pr. language as well:
So that would be 2 + 4*12 = 48+2 = 50. So in total, you get 50+48=98. You can still max out the bonus for both datasets by adding 3 more languages (up to you if you feel like it is worth it). I you want to do that I can review Swedish and Norwegian as well. |
Great, thanks for reviewing! :) Could you give me a hint on how best to upload multilingual datasets? Right now I have the languages as dataset configs, which show up as the subset drop down menu on HF hub. Passing the languages as |
Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones? |
mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py
Outdated
Show resolved
Hide resolved
@x-tabdeveloping we had the discussion in an issue beforehand. I believe the quality is good enough to warrant inclusion (def. better than e.g. retrieval based on articles headlines I would argue). That being said it might introduce odd biases. We can def. examine if that is the case once we start running models. |
Machine translated ones often translate whole passages and not all translation services are good. Generating with temperature 0 and the current prompt basically just 'rephrases' the provided human written document to a single, succinct question. |
@rasdani, and @x-tabdeveloping this does raise an interesting point within the discussion section of the paper: can datasets such as these approximate the performance of high quality datasets. E.g. a comparison between MIRACL and these seems reasonable. btw. @rasdani seems like the tests fail will you have a look at it (it seems like it is due to the mock test overwriting datasets concatenate method) |
@rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge. |
yes, I will! tonight or tomorrow night
Gesendet von Outlook für iOS<https://aka.ms/o0ukef>
…________________________________
Von: Kenneth Enevoldsen ***@***.***>
Gesendet: Tuesday, May 21, 2024 12:05:09 PM
An: embeddings-benchmark/mteb ***@***.***>
Cc: Daniel Auras ***@***.***>; Mention ***@***.***>
Betreff: Re: [embeddings-benchmark/mteb] GPT4-o generated queries for 14 languages (PR #718)
@rasdani<https://github.com/rasdani> I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.
—
Reply to this email directly, view it on GitHub<#718 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARRH3HQPKUGZRZUHCI3NPFLZDML5LAVCNFSM6AAAAABHXGCXBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGI2TMNBZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I added "no" and "sv": I managed to fix the MultilingalReranking and added results. Will try to finish up tomorrow. If you can spot, whether I'm missing an import somewhere, please let me know. When the MultilingualRetrieval works, I will delete the language specific retrieval tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the Norwegian and Swedish both look reasonable!
I modified the points for this current PR such that we all end up with the same number of total points, if I account for the (corrected) 6 points of my old PR.
|
qrels_lang_dict = {} | ||
for qrel in qrels_lang: | ||
if qrel["score"] == 0.5: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I couldn't get WikiepediaRetrievalMultilingual
running with the hard negatives. I assumed one can just set scores inbetween 0 and 1 in the qrels dataset, e.g. 0.5 for hard negatives.
https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-qrels
But MTEB expects int
at some point so I implemented a workaround that just drops all the hard negatives. Would be nice, if we don't waste them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create an issue on this, then we can come back to it later
) * first proper upload of wikipedia-retrieval dataset * update license and README of dataset * fix test split and add WikipediaRetrievalDE task * add WikipediaRerankingDE task * add Bengali tasks * multilingual reranking dataset * add Multilingual Reranking * add Retrieval tasks * update metadata for Reranking task * run make lint * fix metadata validation errors * delete German and Bengali Reranking tasks * fix more task metadata, tests passing now * add retrieval results * WIP: reranking with multilingual dataset * undo changes to run script * update points and contributor info * subcall MultilingualTask for reranking task and add reranking results * WIP: make retrieval a multilingual dataset, too * WIP: first run of WikipediaRetrievalMultilinugal * add WikipediaRetrievalMultilinugal task and results * delete language specific retrieval tasks and results * update points and add Openreview IDs * make lint * remove debugging print statement
Checklist for adding MMTEB dataset
Reason for dataset addition:
Succinct queries generated by a strong multilingual LLM grounded in Wikipedia articles nicely chunked by Cohere should be a strict improvement over a lot of machine translated versions of SQuAD in different languages.
Wikipedia is probably the highest quality (available) corpus in most languages.
see #378
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).WIP and I am running query generation over night for the remaining 12 languages on this list:
Draft PR for early feedback. @KennethEnevoldsen @Muennighoff happy to hear any suggestions :)