mmteb | Arabic | Retrieval Task #669

bakrianoo · 2024-05-11T10:53:15Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

This is a dataset for mmteb initiative.
The Dataset is for Arabic Retrieval tasks
The Dataset is for Keyword-Based searching tasks (The retrieval part in the RAG pipeline)
Although the promising capabilities of using embeddings for semantic search of queries, we still notice some challenges when the query becomes too short and in keywords style.
I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
[] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

KennethEnevoldsen

Any a few minor comments. Especially the size concerns me a bit.

Feel free to add points as well.

mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py

KennethEnevoldsen · 2024-05-15T11:49:31Z

@bakrianoo looks like the tests fail - will you have a look at this

bakrianoo · 2024-05-17T12:40:45Z

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

Ruqyai · 2024-05-17T14:02:56Z

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py
add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.
save the file and run make test

results/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/SadeemKeywordRetrieval.json

KennethEnevoldsen · 2024-05-17T16:12:06Z

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py

add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

KennethEnevoldsen · 2024-05-17T16:14:58Z

mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py

+        date=None,
+        form=["written"],
+        domains=["Blog"],
+        task_subtypes=None,
+        license=None,
+        socioeconomic_status=None,
+        annotations_creators=None,
+        dialect=None,
+        text_creation=None,
+        bibtex_citation=None,
+        n_samples={_EVAL_SPLIT: 7179},
+        avg_character_length={_EVAL_SPLIT: 500.0},


The reason why the test fails is because the metadata is not filled which it should be.

date is the time that the text were written (e.g. scraped from twitter from 2001-2020)
task_subtype I would put Keyword Retrieval and add it to the list of allowed subtypes
license is required
socioeconomic status is the social status of the text writers (e.g. high for lawyers).
dialect should be an empty list if there are no dialects

You can read more about these on the TaskMetadata object

Ruqyai · 2024-05-18T06:27:37Z

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:
go to the file tests/test_TaskMetadata.py
add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.
save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

Thanks @KennethEnevoldsen .. I am doing here PR #763
Please check if you could merge my PR without needs to comment the test_all_metadata_is_filled function.

KennethEnevoldsen · 2024-05-21T09:55:27Z

@bakrianoo would love to have this PR merged in. I will close it for now, but if you have the time please do re-open it and adress the metadata issues. I will make sure it gets a quick review and that we finish up the metadata.

bakrianoo added 12 commits May 7, 2024 09:02

create a new directory for Arabic Retrieval tasks

a4ea22c

Push SadeemQuestionRetrieval dataset

2b916ff

Push SadeemQuestionRetrieval baseline results

d696100

remove invalid comments

2246514

update SadeemQuestionRetrieval metadata

375ae47

update SadeemQuestionRetrieval metadata

7d778a2

add points to the PR

ff68f9f

update points

4d3f0c1

Merge branch 'embeddings-benchmark:main' into main

fa86b69

apply lint

4559cc3

Merge branch 'embeddings-benchmark:main' into main

e8539e9

push SadeemKeywordRetrieval

07161dd

KennethEnevoldsen reviewed May 11, 2024

View reviewed changes

mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py Outdated Show resolved Hide resolved

imenelydiaker assigned KennethEnevoldsen May 11, 2024

bakrianoo added 3 commits May 12, 2024 02:46

remove unused bibtex

cfd9818

update metadata and revision of the dataset

01d6e2e

update SadeemKeywordRetrieval base results

4f023ef

bakrianoo added 3 commits May 17, 2024 15:11

update domain metadata

80408e6

update sadeemQuestions domains value

fbe7939

set some metadata value to None

111ca7f

Ruqyai reviewed May 17, 2024

View reviewed changes

results/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/SadeemKeywordRetrieval.json Show resolved Hide resolved

KennethEnevoldsen reviewed May 17, 2024

View reviewed changes

This was referenced May 18, 2024

Fix : Error [test_all_metadata_is_filled ] #762

Closed

Fix : Error [test_all_metadata_is_filled ] #763

Closed

Merge branch 'main' into dataset-sadeem-keyword-retrieval

d9c8546

KennethEnevoldsen closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmteb | Arabic | Retrieval Task #669

mmteb | Arabic | Retrieval Task #669

bakrianoo commented May 11, 2024

KennethEnevoldsen left a comment

KennethEnevoldsen commented May 15, 2024

bakrianoo commented May 17, 2024

Ruqyai commented May 17, 2024

KennethEnevoldsen commented May 17, 2024 •

edited

KennethEnevoldsen May 17, 2024

Ruqyai commented May 18, 2024 •

edited

KennethEnevoldsen commented May 21, 2024

mmteb | Arabic | Retrieval Task #669

mmteb | Arabic | Retrieval Task #669

Conversation

bakrianoo commented May 11, 2024

Checklist for adding MMTEB dataset

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen commented May 15, 2024

bakrianoo commented May 17, 2024

Ruqyai commented May 17, 2024

KennethEnevoldsen commented May 17, 2024 • edited

KennethEnevoldsen May 17, 2024

Choose a reason for hiding this comment

Ruqyai commented May 18, 2024 • edited

KennethEnevoldsen commented May 21, 2024

KennethEnevoldsen commented May 17, 2024 •

edited

Ruqyai commented May 18, 2024 •

edited