Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dutch CoLA #421

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add Dutch CoLA #421

wants to merge 6 commits into from

Conversation

BramVanroy
Copy link
Contributor

@BramVanroy BramVanroy commented Apr 24, 2024

This PR adds the newly released Dutch CoLA dataset to ScandEval.

For testing this is currently hard-coded to my account but will change to ScandEval when I tested that everything works.

Questions:

  • I cannot get the scandeval CLI command to run, nor the script directly, because of import issues (I believe) "ValueError: Could not find a benchmark class for any of the following potential names: dutch-cola, linguistic-acceptability, sequence-classification.". What is the recommended way of running the command for testing with a local editable install?
  • While the dataset has a training set, I only include validation and test set now in the dataset config. I couldn't find where to define which one is actually used during benchmarking. We probably only want to benchmark on the test set, so how can I define which split to use for the benchmark?
  • I am not sure about the num_few_shot_examples so I kept the same number as for scala.

Once this PR is complete and integrated, I can run all 7B and below models on this benchmark and send the results to add them to the leaderboard, if that is helpful.

closes #419

@saattrupdan
Copy link
Member

Thanks for your contribution! 😀

Answers to your questions:

I cannot get the scandeval CLI command to run, nor the script directly, because of import issues (I believe) "ValueError: Could not find a benchmark class for any of the following potential names: dutch-cola, linguistic-acceptability, sequence-classification.". What is the recommended way of running the command for testing with a local editable install?

To install from source, please remove your virtual environment and run make install. If this doesn't work on your side then let me know!

While the dataset has a training set, I only include validation and test set now in the dataset config. I couldn't find where to define which one is actually used during benchmarking. We probably only want to benchmark on the test set, so how can I define which split to use for the benchmark?

All datasets must have a training split. Note that encoder models should also be able to be benchmarked on this dataset, but even decoder models needs a training dataset for the few-shot examples. The standard split size is 1024/256/2048 for train/val/test, so that would be great here as well 🙂

I am not sure about the num_few_shot_examples so I kept the same number as for scala.

Sounds good!

Copy link
Member

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/suggestions.

src/scandeval/dataset_configs.py Outdated Show resolved Hide resolved

# Download the dataset
dataset = load_dataset(path=repo_id, token=True)
del dataset["train"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my overall comment, we need train/val/test splits for all datasets. I'd recommend 1024/256/2048 here, ideally sampled from the original train/val/test splits.

@BramVanroy
Copy link
Contributor Author

Thanks for the feedback! Another question: is there a reason why the full test set is not used? Is this clear to the users of the leaderboard, that the results are only a subset of the full test sets? My fear is mostly in academic research, where one may benchmark manually on a test set and compare with reported numbers of the scandeval leaderboard - but since we use subsets on the leaderboard, that would not be comparable.

@saattrupdan
Copy link
Member

Thanks for the feedback! Another question: is there a reason why the full test set is not used? Is this clear to the users of the leaderboard, that the results are only a subset of the full test sets? My fear is mostly in academic research, where one may benchmark manually on a test set and compare with reported numbers of the scandeval leaderboard - but since we use subsets on the leaderboard, that would not be comparable.

That depends. If it's an unofficial dataset, I don't see why not to go for the full dataset. But for the official ones many users care about speed - they're benchmarking on 7 languages x 7 datasets, and they don't want to wait for months for a benchmark. Remember that we do 10 iterations on (bootstrapped versions of) the test set.

Also, when benchmarking, that's why I've set pretty_name (which is the name that shows up in the terminal when benchmarking) to "Truncated version of ", as well as naming the HF Hub ID's to ScandEval/<dataset>-mini, to emphasise this.

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
@BramVanroy
Copy link
Contributor Author

@saattrupdan Thanks for the reply. That makes sense. For academic reporting, having full test sets included might be useful too. This is just an idea, and something that will take a lot of work I think, but perhaps the full datasets can also be added, either "relatively simple" where you have dataset_name (default) and optionally the full set dataset_name_full, or the concept of "subsets" where each dataset can have different subsets (e.g. mini, or full) and a CLI flag to decide which one to use.

@BramVanroy BramVanroy marked this pull request as ready for review April 27, 2024 09:52
@BramVanroy
Copy link
Contributor Author

FYI: it was quite a pain to get the make install to work nicely, especially git. Instead of using my global git config it required a new local one (unless I missed something), which in turn lead to some issues I had to figure out (deleting local git file, amending commits with global config).

Copy link
Member

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea of having the -full postfix on datasets with larger splits. I've made some suggestions on how to include those here. I think that's conceptually simpler than having a separate "mini/full" flag.

The failing tests should be fixed if your pull from the main branch.

Also, I get that the Git config setup can be annoying - I think I'll just remove that bit from the make recipe in the future.

Comment on lines +27 to +38
# Create dataset ID
dataset_id = "ScandEval/dutch-cola"

# Remove the dataset from Hugging Face Hub if it already exists
try:
api = HfApi()
api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True)
except HTTPError:
pass

# Push the dataset to the Hugging Face Hub
dataset.push_to_hub(dataset_id, private=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create dataset ID
dataset_id = "ScandEval/dutch-cola"
# Remove the dataset from Hugging Face Hub if it already exists
try:
api = HfApi()
api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True)
except HTTPError:
pass
# Push the dataset to the Hugging Face Hub
dataset.push_to_hub(dataset_id, private=True)
full_dataset_id = "ScandEval/dutch-cola-full"
dataset_id = "ScandEval/dutch-cola"
# Remove the dataset from Hugging Face Hub if it already exists
for id in [dataset_id, full_dataset_id]:
try:
api = HfApi()
api.delete_repo(id, repo_type="dataset", missing_ok=True)
except HTTPError:
pass
dataset.push_to_hub(full_dataset_id, private=True)
# Convert the dataset to a dataframe
df = dataset.to_pandas()
assert isinstance(df, pd.DataFrame)
# Create validation split
val_size = 256
traintest_arr, val_arr = train_test_split(df, test_size=val_size, random_state=4242)
traintest_df = pd.DataFrame(traintest_arr, columns=df.columns)
val_df = pd.DataFrame(val_arr, columns=df.columns)
# Create test split
test_size = 2048
train_arr, test_arr = train_test_split(
traintest_df, test_size=test_size, random_state=4242
)
train_df = pd.DataFrame(train_arr, columns=df.columns)
test_df = pd.DataFrame(test_arr, columns=df.columns)
# Create train split
train_size = 256
train_df = train_df.sample(train_size, random_state=4242)
# Reset the index
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
# Collect datasets in a dataset dictionary
dataset = DatasetDict(
train=Dataset.from_pandas(train_df, split=Split.TRAIN),
val=Dataset.from_pandas(val_df, split=Split.VALIDATION),
test=Dataset.from_pandas(test_df, split=Split.TEST),
)
dataset.push_to_hub(dataset_id, private=True)

Comment on lines +565 to +577
DUTCH_COLA_CONFIG = DatasetConfig(
name="dutch-cola",
pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset",
huggingface_id="ScandEval/dutch-cola",
task=LA,
languages=[NL],
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",
prompt_template="Zin: {text}\nGrammaticaal correct: {label}",
prompt_label_mapping=dict(correct="ja", incorrect="nee"),
num_few_shot_examples=12,
max_generated_tokens=3,
unofficial=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DUTCH_COLA_CONFIG = DatasetConfig(
name="dutch-cola",
pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset",
huggingface_id="ScandEval/dutch-cola",
task=LA,
languages=[NL],
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",
prompt_template="Zin: {text}\nGrammaticaal correct: {label}",
prompt_label_mapping=dict(correct="ja", incorrect="nee"),
num_few_shot_examples=12,
max_generated_tokens=3,
unofficial=True,
)
DUTCH_COLA_CONFIG = DatasetConfig(
name="dutch-cola",
pretty_name="the truncated version of the Dutch linguistic acceptability dataset Dutch CoLA",
huggingface_id="ScandEval/dutch-cola",
task=LA,
languages=[NL],
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",
prompt_template="Zin: {text}\nGrammaticaal correct: {label}",
prompt_label_mapping=dict(correct="ja", incorrect="nee"),
num_few_shot_examples=12,
max_generated_tokens=3,
unofficial=True,
)
DUTCH_COLA_FULL_CONFIG = DatasetConfig(
name="dutch-cola-full",
pretty_name="the Dutch linguistic acceptability dataset Dutch CoLA",
huggingface_id="ScandEval/dutch-cola-full",
task=LA,
languages=[NL],
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.",
prompt_template="Zin: {text}\nGrammaticaal correct: {label}",
prompt_label_mapping=dict(correct="ja", incorrect="nee"),
num_few_shot_examples=12,
max_generated_tokens=3,
unofficial=True,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BENCHMARK DATASET REQUEST] dutch-cola
2 participants