-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dutch CoLA #421
base: main
Are you sure you want to change the base?
Add Dutch CoLA #421
Conversation
Thanks for your contribution! 😀 Answers to your questions:
To install from source, please remove your virtual environment and run
All datasets must have a training split. Note that encoder models should also be able to be benchmarked on this dataset, but even decoder models needs a training dataset for the few-shot examples. The standard split size is 1024/256/2048 for train/val/test, so that would be great here as well 🙂
Sounds good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/suggestions.
src/scripts/create_dutch_cola.py
Outdated
|
||
# Download the dataset | ||
dataset = load_dataset(path=repo_id, token=True) | ||
del dataset["train"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my overall comment, we need train/val/test splits for all datasets. I'd recommend 1024/256/2048 here, ideally sampled from the original train/val/test splits.
Thanks for the feedback! Another question: is there a reason why the full test set is not used? Is this clear to the users of the leaderboard, that the results are only a subset of the full test sets? My fear is mostly in academic research, where one may benchmark manually on a test set and compare with reported numbers of the scandeval leaderboard - but since we use subsets on the leaderboard, that would not be comparable. |
That depends. If it's an unofficial dataset, I don't see why not to go for the full dataset. But for the official ones many users care about speed - they're benchmarking on 7 languages x 7 datasets, and they don't want to wait for months for a benchmark. Remember that we do 10 iterations on (bootstrapped versions of) the test set. Also, when benchmarking, that's why I've set |
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
@saattrupdan Thanks for the reply. That makes sense. For academic reporting, having full test sets included might be useful too. This is just an idea, and something that will take a lot of work I think, but perhaps the full datasets can also be added, either "relatively simple" where you have |
FYI: it was quite a pain to get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea of having the -full
postfix on datasets with larger splits. I've made some suggestions on how to include those here. I think that's conceptually simpler than having a separate "mini/full" flag.
The failing tests should be fixed if your pull from the main branch.
Also, I get that the Git config setup can be annoying - I think I'll just remove that bit from the make recipe in the future.
# Create dataset ID | ||
dataset_id = "ScandEval/dutch-cola" | ||
|
||
# Remove the dataset from Hugging Face Hub if it already exists | ||
try: | ||
api = HfApi() | ||
api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True) | ||
except HTTPError: | ||
pass | ||
|
||
# Push the dataset to the Hugging Face Hub | ||
dataset.push_to_hub(dataset_id, private=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Create dataset ID | |
dataset_id = "ScandEval/dutch-cola" | |
# Remove the dataset from Hugging Face Hub if it already exists | |
try: | |
api = HfApi() | |
api.delete_repo(dataset_id, repo_type="dataset", missing_ok=True) | |
except HTTPError: | |
pass | |
# Push the dataset to the Hugging Face Hub | |
dataset.push_to_hub(dataset_id, private=True) | |
full_dataset_id = "ScandEval/dutch-cola-full" | |
dataset_id = "ScandEval/dutch-cola" | |
# Remove the dataset from Hugging Face Hub if it already exists | |
for id in [dataset_id, full_dataset_id]: | |
try: | |
api = HfApi() | |
api.delete_repo(id, repo_type="dataset", missing_ok=True) | |
except HTTPError: | |
pass | |
dataset.push_to_hub(full_dataset_id, private=True) | |
# Convert the dataset to a dataframe | |
df = dataset.to_pandas() | |
assert isinstance(df, pd.DataFrame) | |
# Create validation split | |
val_size = 256 | |
traintest_arr, val_arr = train_test_split(df, test_size=val_size, random_state=4242) | |
traintest_df = pd.DataFrame(traintest_arr, columns=df.columns) | |
val_df = pd.DataFrame(val_arr, columns=df.columns) | |
# Create test split | |
test_size = 2048 | |
train_arr, test_arr = train_test_split( | |
traintest_df, test_size=test_size, random_state=4242 | |
) | |
train_df = pd.DataFrame(train_arr, columns=df.columns) | |
test_df = pd.DataFrame(test_arr, columns=df.columns) | |
# Create train split | |
train_size = 256 | |
train_df = train_df.sample(train_size, random_state=4242) | |
# Reset the index | |
train_df = train_df.reset_index(drop=True) | |
val_df = val_df.reset_index(drop=True) | |
test_df = test_df.reset_index(drop=True) | |
# Collect datasets in a dataset dictionary | |
dataset = DatasetDict( | |
train=Dataset.from_pandas(train_df, split=Split.TRAIN), | |
val=Dataset.from_pandas(val_df, split=Split.VALIDATION), | |
test=Dataset.from_pandas(test_df, split=Split.TEST), | |
) | |
dataset.push_to_hub(dataset_id, private=True) |
DUTCH_COLA_CONFIG = DatasetConfig( | ||
name="dutch-cola", | ||
pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset", | ||
huggingface_id="ScandEval/dutch-cola", | ||
task=LA, | ||
languages=[NL], | ||
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.", | ||
prompt_template="Zin: {text}\nGrammaticaal correct: {label}", | ||
prompt_label_mapping=dict(correct="ja", incorrect="nee"), | ||
num_few_shot_examples=12, | ||
max_generated_tokens=3, | ||
unofficial=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DUTCH_COLA_CONFIG = DatasetConfig( | |
name="dutch-cola", | |
pretty_name="a linguistic acceptability dataset for Dutch, Dutch CoLA, inspired by the original CoLA dataset", | |
huggingface_id="ScandEval/dutch-cola", | |
task=LA, | |
languages=[NL], | |
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.", | |
prompt_template="Zin: {text}\nGrammaticaal correct: {label}", | |
prompt_label_mapping=dict(correct="ja", incorrect="nee"), | |
num_few_shot_examples=12, | |
max_generated_tokens=3, | |
unofficial=True, | |
) | |
DUTCH_COLA_CONFIG = DatasetConfig( | |
name="dutch-cola", | |
pretty_name="the truncated version of the Dutch linguistic acceptability dataset Dutch CoLA", | |
huggingface_id="ScandEval/dutch-cola", | |
task=LA, | |
languages=[NL], | |
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.", | |
prompt_template="Zin: {text}\nGrammaticaal correct: {label}", | |
prompt_label_mapping=dict(correct="ja", incorrect="nee"), | |
num_few_shot_examples=12, | |
max_generated_tokens=3, | |
unofficial=True, | |
) | |
DUTCH_COLA_FULL_CONFIG = DatasetConfig( | |
name="dutch-cola-full", | |
pretty_name="the Dutch linguistic acceptability dataset Dutch CoLA", | |
huggingface_id="ScandEval/dutch-cola-full", | |
task=LA, | |
languages=[NL], | |
prompt_prefix="Hieronder staan zinnen en of ze grammaticaal correct ('ja') of incorrect ('nee') zijn.", | |
prompt_template="Zin: {text}\nGrammaticaal correct: {label}", | |
prompt_label_mapping=dict(correct="ja", incorrect="nee"), | |
num_few_shot_examples=12, | |
max_generated_tokens=3, | |
unofficial=True, | |
) |
This PR adds the newly released Dutch CoLA dataset to ScandEval.
For testing this is currently hard-coded to my account but will change to ScandEval when I tested that everything works.
Questions:
scandeval
CLI command to run, nor the script directly, because of import issues (I believe) "ValueError: Could not find a benchmark class for any of the following potential names: dutch-cola, linguistic-acceptability, sequence-classification.". What is the recommended way of running the command for testing with a local editable install?Once this PR is complete and integrated, I can run all 7B and below models on this benchmark and send the results to add them to the leaderboard, if that is helpful.
closes #419