Add Dutch CoLA #421

BramVanroy · 2024-04-24T20:00:28Z

This PR adds the newly released Dutch CoLA dataset to ScandEval.

For testing this is currently hard-coded to my account but will change to ScandEval when I tested that everything works.

Questions:

I cannot get the scandeval CLI command to run, nor the script directly, because of import issues (I believe) "ValueError: Could not find a benchmark class for any of the following potential names: dutch-cola, linguistic-acceptability, sequence-classification.". What is the recommended way of running the command for testing with a local editable install?
While the dataset has a training set, I only include validation and test set now in the dataset config. I couldn't find where to define which one is actually used during benchmarking. We probably only want to benchmark on the test set, so how can I define which split to use for the benchmark?
I am not sure about the num_few_shot_examples so I kept the same number as for scala.

Once this PR is complete and integrated, I can run all 7B and below models on this benchmark and send the results to add them to the leaderboard, if that is helpful.

closes #419

saattrupdan · 2024-04-26T08:47:32Z

Thanks for your contribution! 😀

Answers to your questions:

I cannot get the scandeval CLI command to run, nor the script directly, because of import issues (I believe) "ValueError: Could not find a benchmark class for any of the following potential names: dutch-cola, linguistic-acceptability, sequence-classification.". What is the recommended way of running the command for testing with a local editable install?

To install from source, please remove your virtual environment and run make install. If this doesn't work on your side then let me know!

While the dataset has a training set, I only include validation and test set now in the dataset config. I couldn't find where to define which one is actually used during benchmarking. We probably only want to benchmark on the test set, so how can I define which split to use for the benchmark?

All datasets must have a training split. Note that encoder models should also be able to be benchmarked on this dataset, but even decoder models needs a training dataset for the few-shot examples. The standard split size is 1024/256/2048 for train/val/test, so that would be great here as well 🙂

I am not sure about the num_few_shot_examples so I kept the same number as for scala.

Sounds good!

saattrupdan

A few comments/suggestions.

src/scandeval/dataset_configs.py

saattrupdan · 2024-04-26T08:49:30Z

src/scripts/create_dutch_cola.py

+
+    # Download the dataset
+    dataset = load_dataset(path=repo_id, token=True)
+    del dataset["train"]


As mentioned in my overall comment, we need train/val/test splits for all datasets. I'd recommend 1024/256/2048 here, ideally sampled from the original train/val/test splits.

BramVanroy · 2024-04-26T08:56:00Z

Thanks for the feedback! Another question: is there a reason why the full test set is not used? Is this clear to the users of the leaderboard, that the results are only a subset of the full test sets? My fear is mostly in academic research, where one may benchmark manually on a test set and compare with reported numbers of the scandeval leaderboard - but since we use subsets on the leaderboard, that would not be comparable.

saattrupdan · 2024-04-26T09:36:38Z

Thanks for the feedback! Another question: is there a reason why the full test set is not used? Is this clear to the users of the leaderboard, that the results are only a subset of the full test sets? My fear is mostly in academic research, where one may benchmark manually on a test set and compare with reported numbers of the scandeval leaderboard - but since we use subsets on the leaderboard, that would not be comparable.

That depends. If it's an unofficial dataset, I don't see why not to go for the full dataset. But for the official ones many users care about speed - they're benchmarking on 7 languages x 7 datasets, and they don't want to wait for months for a benchmark. Remember that we do 10 iterations on (bootstrapped versions of) the test set.

Also, when benchmarking, that's why I've set pretty_name (which is the name that shows up in the terminal when benchmarking) to "Truncated version of ", as well as naming the HF Hub ID's to ScandEval/<dataset>-mini, to emphasise this.

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

BramVanroy · 2024-04-27T08:24:08Z

@saattrupdan Thanks for the reply. That makes sense. For academic reporting, having full test sets included might be useful too. This is just an idea, and something that will take a lot of work I think, but perhaps the full datasets can also be added, either "relatively simple" where you have dataset_name (default) and optionally the full set dataset_name_full, or the concept of "subsets" where each dataset can have different subsets (e.g. mini, or full) and a CLI flag to decide which one to use.

BramVanroy · 2024-04-27T09:55:38Z

FYI: it was quite a pain to get the make install to work nicely, especially git. Instead of using my global git config it required a new local one (unless I missed something), which in turn lead to some issues I had to figure out (deleting local git file, amending commits with global config).

saattrupdan

I like your idea of having the -full postfix on datasets with larger splits. I've made some suggestions on how to include those here. I think that's conceptually simpler than having a separate "mini/full" flag.

The failing tests should be fixed if your pull from the main branch.

Also, I get that the Git config setup can be annoying - I think I'll just remove that bit from the make recipe in the future.

saattrupdan · 2024-04-28T08:15:04Z

src/scripts/create_dutch_cola.py

+    # Create dataset ID
+    dataset_id = "ScandEval/dutch-cola"
+
+    # Remove the dataset from Hugging Face Hub if it already exists
+    try:
+        api = HfApi()
+        api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True)
+    except HTTPError:
+        pass
+
+    # Push the dataset to the Hugging Face Hub
+    dataset.push_to_hub(dataset_id, private=True)


Suggested change

# Create dataset ID

dataset_id = "ScandEval/dutch-cola"

# Remove the dataset from Hugging Face Hub if it already exists

try:

api = HfApi()

api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True)

except HTTPError:

pass

# Push the dataset to the Hugging Face Hub

dataset.push_to_hub(dataset_id, private=True)

full_dataset_id = "ScandEval/dutch-cola-full"

dataset_id = "ScandEval/dutch-cola"

# Remove the dataset from Hugging Face Hub if it already exists

for id in [dataset_id, full_dataset_id]:

try:

api = HfApi()

api.delete_repo(id, repo_type="dataset", missing_ok=True)

except HTTPError:

pass

dataset.push_to_hub(full_dataset_id, private=True)

# Convert the dataset to a dataframe

df = dataset.to_pandas()

assert isinstance(df, pd.DataFrame)

# Create validation split

val_size = 256

traintest_arr, val_arr = train_test_split(df, test_size=val_size, random_state=4242)

traintest_df = pd.DataFrame(traintest_arr, columns=df.columns)

val_df = pd.DataFrame(val_arr, columns=df.columns)

# Create test split

test_size = 2048

train_arr, test_arr = train_test_split(

traintest_df, test_size=test_size, random_state=4242

)

train_df = pd.DataFrame(train_arr, columns=df.columns)

test_df = pd.DataFrame(test_arr, columns=df.columns)

# Create train split

train_size = 256

train_df = train_df.sample(train_size, random_state=4242)

# Reset the index

train_df = train_df.reset_index(drop=True)

val_df = val_df.reset_index(drop=True)

test_df = test_df.reset_index(drop=True)

# Collect datasets in a dataset dictionary

dataset = DatasetDict(

train=Dataset.from_pandas(train_df, split=Split.TRAIN),

val=Dataset.from_pandas(val_df, split=Split.VALIDATION),

test=Dataset.from_pandas(test_df, split=Split.TEST),

)

dataset.push_to_hub(dataset_id, private=True)

saattrupdan · 2024-04-28T08:18:29Z

src/scandeval/dataset_configs.py

+DUTCH_COLA_CONFIG = DatasetConfig(
+    name="dutch-cola",
+    pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset",
+    huggingface_id="ScandEval/dutch-cola",
+    task=LA,
+    languages=[NL],
+    prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",
+    prompt_template="Zin: {text}\nGrammaticaal correct: {label}",
+    prompt_label_mapping=dict(correct="ja", incorrect="nee"),
+    num_few_shot_examples=12,
+    max_generated_tokens=3,
+    unofficial=True,
+)


Suggested change

DUTCH_COLA_CONFIG = DatasetConfig(

name="dutch-cola",

pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset",

huggingface_id="ScandEval/dutch-cola",

task=LA,

languages=[NL],

prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",

prompt_template="Zin: {text}\nGrammaticaal correct: {label}",

prompt_label_mapping=dict(correct="ja", incorrect="nee"),

num_few_shot_examples=12,

max_generated_tokens=3,

unofficial=True,

)

DUTCH_COLA_CONFIG = DatasetConfig(

name="dutch-cola",

pretty_name="the truncated version of the Dutch linguistic acceptability dataset Dutch CoLA",

huggingface_id="ScandEval/dutch-cola",

task=LA,

languages=[NL],

prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",

prompt_template="Zin: {text}\nGrammaticaal correct: {label}",

prompt_label_mapping=dict(correct="ja", incorrect="nee"),

num_few_shot_examples=12,

max_generated_tokens=3,

unofficial=True,

)

DUTCH_COLA_FULL_CONFIG = DatasetConfig(

name="dutch-cola-full",

pretty_name="the Dutch linguistic acceptability dataset Dutch CoLA",

huggingface_id="ScandEval/dutch-cola-full",

task=LA,

languages=[NL],

prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",

prompt_template="Zin: {text}\nGrammaticaal correct: {label}",

prompt_label_mapping=dict(correct="ja", incorrect="nee"),

num_few_shot_examples=12,

max_generated_tokens=3,

unofficial=True,

)

BramVanroy added 3 commits April 24, 2024 21:42

add dutch-cola

35d3aef

include test and validation

5082792

improve dataset description

34c8b04

saattrupdan requested changes Apr 26, 2024

View reviewed changes

saattrupdan assigned BramVanroy Apr 26, 2024

Update src/scandeval/dataset_configs.py

9021f07

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

BramVanroy added 2 commits April 27, 2024 11:51

add dutch social to hf scandeval

0ea6100

Merge branch 'main' into main

30a853a

BramVanroy marked this pull request as ready for review April 27, 2024 09:52

saattrupdan requested changes Apr 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dutch CoLA #421

Add Dutch CoLA #421

BramVanroy commented Apr 24, 2024 •

edited

saattrupdan commented Apr 26, 2024

saattrupdan left a comment

saattrupdan Apr 26, 2024

BramVanroy commented Apr 26, 2024

saattrupdan commented Apr 26, 2024

BramVanroy commented Apr 27, 2024

BramVanroy commented Apr 27, 2024

saattrupdan left a comment

saattrupdan Apr 28, 2024

saattrupdan Apr 28, 2024

-    # Create dataset ID
-    dataset_id = "ScandEval/dutch-cola"
-    # Remove the dataset from Hugging Face Hub if it already exists
-    try:
-        api = HfApi()
-        api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True)
-    except HTTPError:
-        pass
-    # Push the dataset to the Hugging Face Hub
-    dataset.push_to_hub(dataset_id, private=True)
+    full_dataset_id = "ScandEval/dutch-cola-full"
+    dataset_id = "ScandEval/dutch-cola"
+    # Remove the dataset from Hugging Face Hub if it already exists
+    for id in [dataset_id, full_dataset_id]:
+        try:
+            api = HfApi()
+            api.delete_repo(id, repo_type="dataset", missing_ok=True)
+        except HTTPError:
+            pass
+    dataset.push_to_hub(full_dataset_id, private=True)
+    # Convert the dataset to a dataframe
+    df = dataset.to_pandas()
+    assert isinstance(df, pd.DataFrame)
+    # Create validation split
+    val_size = 256
+    traintest_arr, val_arr = train_test_split(df, test_size=val_size, random_state=4242)
+    traintest_df = pd.DataFrame(traintest_arr, columns=df.columns)
+    val_df = pd.DataFrame(val_arr, columns=df.columns)
+    # Create test split
+    test_size = 2048
+    train_arr, test_arr = train_test_split(
+        traintest_df, test_size=test_size, random_state=4242
+    )
+    train_df = pd.DataFrame(train_arr, columns=df.columns)
+    test_df = pd.DataFrame(test_arr, columns=df.columns)
+    # Create train split
+    train_size = 256
+    train_df = train_df.sample(train_size, random_state=4242)
+    # Reset the index
+    train_df = train_df.reset_index(drop=True)
+    val_df = val_df.reset_index(drop=True)
+    test_df = test_df.reset_index(drop=True)
+    # Collect datasets in a dataset dictionary
+    dataset = DatasetDict(
+        train=Dataset.from_pandas(train_df, split=Split.TRAIN),
+        val=Dataset.from_pandas(val_df, split=Split.VALIDATION),
+        test=Dataset.from_pandas(test_df, split=Split.TEST),
+    )
+    dataset.push_to_hub(dataset_id, private=True)

Add Dutch CoLA #421

Are you sure you want to change the base?

Add Dutch CoLA #421

Conversation

BramVanroy commented Apr 24, 2024 • edited

saattrupdan commented Apr 26, 2024

saattrupdan left a comment

Choose a reason for hiding this comment

saattrupdan Apr 26, 2024

Choose a reason for hiding this comment

BramVanroy commented Apr 26, 2024

saattrupdan commented Apr 26, 2024

BramVanroy commented Apr 27, 2024

BramVanroy commented Apr 27, 2024

saattrupdan left a comment

Choose a reason for hiding this comment

saattrupdan Apr 28, 2024

Choose a reason for hiding this comment

saattrupdan Apr 28, 2024

Choose a reason for hiding this comment

BramVanroy commented Apr 24, 2024 •

edited