Skip to content

Latest commit

 

History

History
433 lines (433 loc) · 105 KB

task_table.md

File metadata and controls

433 lines (433 loc) · 105 KB
Task Name Train Val Test Val/Test Docs Metrics
anagrams1 10000 acc
anagrams2 10000 acc
anli_r1 1000 acc
anli_r2 1000 acc
anli_r3 1200 acc
arc_challenge 1172 acc, acc_norm
arc_easy 2376 acc, acc_norm
arithmetic_1dc 2000 acc
arithmetic_2da 2000 acc
arithmetic_2dm 2000 acc
arithmetic_2ds 2000 acc
arithmetic_3da 2000 acc
arithmetic_3ds 2000 acc
arithmetic_4da 2000 acc
arithmetic_4ds 2000 acc
arithmetic_5da 2000 acc
arithmetic_5ds 2000 acc
bigbench_causal_judgement 190 multiple_choice_grade, exact_str_match
bigbench_date_understanding 369 multiple_choice_grade, exact_str_match
bigbench_disambiguation_qa 258 multiple_choice_grade, exact_str_match
bigbench_dyck_languages 1000 multiple_choice_grade, exact_str_match
bigbench_formal_fallacies_syllogisms_negation 14200 multiple_choice_grade, exact_str_match
bigbench_geometric_shapes 359 multiple_choice_grade, exact_str_match
bigbench_hyperbaton 50000 multiple_choice_grade, exact_str_match
bigbench_logical_deduction_five_objects 500 multiple_choice_grade, exact_str_match
bigbench_logical_deduction_seven_objects 700 multiple_choice_grade, exact_str_match
bigbench_logical_deduction_three_objects 300 multiple_choice_grade, exact_str_match
bigbench_movie_recommendation 500 multiple_choice_grade, exact_str_match
bigbench_navigate 1000 multiple_choice_grade, exact_str_match
bigbench_reasoning_about_colored_objects 2000 multiple_choice_grade, exact_str_match
bigbench_ruin_names 448 multiple_choice_grade, exact_str_match
bigbench_salient_translation_error_detection 998 multiple_choice_grade, exact_str_match
bigbench_snarks 181 multiple_choice_grade, exact_str_match
bigbench_sports_understanding 986 multiple_choice_grade, exact_str_match
bigbench_temporal_sequences 1000 multiple_choice_grade, exact_str_match
bigbench_tracking_shuffled_objects_five_objects 1250 multiple_choice_grade, exact_str_match
bigbench_tracking_shuffled_objects_seven_objects 1750 multiple_choice_grade, exact_str_match
bigbench_tracking_shuffled_objects_three_objects 300 multiple_choice_grade, exact_str_match
blimp_adjunct_island 1000 acc
blimp_anaphor_gender_agreement 1000 acc
blimp_anaphor_number_agreement 1000 acc
blimp_animate_subject_passive 1000 acc
blimp_animate_subject_trans 1000 acc
blimp_causative 1000 acc
blimp_complex_NP_island 1000 acc
blimp_coordinate_structure_constraint_complex_left_branch 1000 acc
blimp_coordinate_structure_constraint_object_extraction 1000 acc
blimp_determiner_noun_agreement_1 1000 acc
blimp_determiner_noun_agreement_2 1000 acc
blimp_determiner_noun_agreement_irregular_1 1000 acc
blimp_determiner_noun_agreement_irregular_2 1000 acc
blimp_determiner_noun_agreement_with_adj_2 1000 acc
blimp_determiner_noun_agreement_with_adj_irregular_1 1000 acc
blimp_determiner_noun_agreement_with_adj_irregular_2 1000 acc
blimp_determiner_noun_agreement_with_adjective_1 1000 acc
blimp_distractor_agreement_relational_noun 1000 acc
blimp_distractor_agreement_relative_clause 1000 acc
blimp_drop_argument 1000 acc
blimp_ellipsis_n_bar_1 1000 acc
blimp_ellipsis_n_bar_2 1000 acc
blimp_existential_there_object_raising 1000 acc
blimp_existential_there_quantifiers_1 1000 acc
blimp_existential_there_quantifiers_2 1000 acc
blimp_existential_there_subject_raising 1000 acc
blimp_expletive_it_object_raising 1000 acc
blimp_inchoative 1000 acc
blimp_intransitive 1000 acc
blimp_irregular_past_participle_adjectives 1000 acc
blimp_irregular_past_participle_verbs 1000 acc
blimp_irregular_plural_subject_verb_agreement_1 1000 acc
blimp_irregular_plural_subject_verb_agreement_2 1000 acc
blimp_left_branch_island_echo_question 1000 acc
blimp_left_branch_island_simple_question 1000 acc
blimp_matrix_question_npi_licensor_present 1000 acc
blimp_npi_present_1 1000 acc
blimp_npi_present_2 1000 acc
blimp_only_npi_licensor_present 1000 acc
blimp_only_npi_scope 1000 acc
blimp_passive_1 1000 acc
blimp_passive_2 1000 acc
blimp_principle_A_c_command 1000 acc
blimp_principle_A_case_1 1000 acc
blimp_principle_A_case_2 1000 acc
blimp_principle_A_domain_1 1000 acc
blimp_principle_A_domain_2 1000 acc
blimp_principle_A_domain_3 1000 acc
blimp_principle_A_reconstruction 1000 acc
blimp_regular_plural_subject_verb_agreement_1 1000 acc
blimp_regular_plural_subject_verb_agreement_2 1000 acc
blimp_sentential_negation_npi_licensor_present 1000 acc
blimp_sentential_negation_npi_scope 1000 acc
blimp_sentential_subject_island 1000 acc
blimp_superlative_quantifiers_1 1000 acc
blimp_superlative_quantifiers_2 1000 acc
blimp_tough_vs_raising_1 1000 acc
blimp_tough_vs_raising_2 1000 acc
blimp_transitive 1000 acc
blimp_wh_island 1000 acc
blimp_wh_questions_object_gap 1000 acc
blimp_wh_questions_subject_gap 1000 acc
blimp_wh_questions_subject_gap_long_distance 1000 acc
blimp_wh_vs_that_no_gap 1000 acc
blimp_wh_vs_that_no_gap_long_distance 1000 acc
blimp_wh_vs_that_with_gap 1000 acc
blimp_wh_vs_that_with_gap_long_distance 1000 acc
boolq 3270 acc
cb 56 acc, f1
cola 1043 mcc
copa 100 acc
coqa 500 f1, em
crows_pairs_english 1677 likelihood_difference, pct_stereotype
crows_pairs_english_age 91 likelihood_difference, pct_stereotype
crows_pairs_english_autre 11 likelihood_difference, pct_stereotype
crows_pairs_english_disability 65 likelihood_difference, pct_stereotype
crows_pairs_english_gender 320 likelihood_difference, pct_stereotype
crows_pairs_english_nationality 216 likelihood_difference, pct_stereotype
crows_pairs_english_physical_appearance 72 likelihood_difference, pct_stereotype
crows_pairs_english_race_color 508 likelihood_difference, pct_stereotype
crows_pairs_english_religion 111 likelihood_difference, pct_stereotype
crows_pairs_english_sexual_orientation 93 likelihood_difference, pct_stereotype
crows_pairs_english_socioeconomic 190 likelihood_difference, pct_stereotype
crows_pairs_french 1677 likelihood_difference, pct_stereotype
crows_pairs_french_age 90 likelihood_difference, pct_stereotype
crows_pairs_french_autre 13 likelihood_difference, pct_stereotype
crows_pairs_french_disability 66 likelihood_difference, pct_stereotype
crows_pairs_french_gender 321 likelihood_difference, pct_stereotype
crows_pairs_french_nationality 253 likelihood_difference, pct_stereotype
crows_pairs_french_physical_appearance 72 likelihood_difference, pct_stereotype
crows_pairs_french_race_color 460 likelihood_difference, pct_stereotype
crows_pairs_french_religion 115 likelihood_difference, pct_stereotype
crows_pairs_french_sexual_orientation 91 likelihood_difference, pct_stereotype
crows_pairs_french_socioeconomic 196 likelihood_difference, pct_stereotype
cycle_letters 10000 acc
drop 9536 em, f1
ethics_cm 3885 acc
ethics_deontology 3596 acc, em
ethics_justice 2704 acc, em
ethics_utilitarianism 4808 acc
ethics_utilitarianism_original 4808 acc
ethics_virtue 4975 acc, em
gsm8k 1319 acc
headqa 2742 acc, acc_norm
headqa_en 2742 acc, acc_norm
headqa_es 2742 acc, acc_norm
hellaswag 10042 acc, acc_norm
hendrycksTest-abstract_algebra 100 acc, acc_norm
hendrycksTest-anatomy 135 acc, acc_norm
hendrycksTest-astronomy 152 acc, acc_norm
hendrycksTest-business_ethics 100 acc, acc_norm
hendrycksTest-clinical_knowledge 265 acc, acc_norm
hendrycksTest-college_biology 144 acc, acc_norm
hendrycksTest-college_chemistry 100 acc, acc_norm
hendrycksTest-college_computer_science 100 acc, acc_norm
hendrycksTest-college_mathematics 100 acc, acc_norm
hendrycksTest-college_medicine 173 acc, acc_norm
hendrycksTest-college_physics 102 acc, acc_norm
hendrycksTest-computer_security 100 acc, acc_norm
hendrycksTest-conceptual_physics 235 acc, acc_norm
hendrycksTest-econometrics 114 acc, acc_norm
hendrycksTest-electrical_engineering 145 acc, acc_norm
hendrycksTest-elementary_mathematics 378 acc, acc_norm
hendrycksTest-formal_logic 126 acc, acc_norm
hendrycksTest-global_facts 100 acc, acc_norm
hendrycksTest-high_school_biology 310 acc, acc_norm
hendrycksTest-high_school_chemistry 203 acc, acc_norm
hendrycksTest-high_school_computer_science 100 acc, acc_norm
hendrycksTest-high_school_european_history 165 acc, acc_norm
hendrycksTest-high_school_geography 198 acc, acc_norm
hendrycksTest-high_school_government_and_politics 193 acc, acc_norm
hendrycksTest-high_school_macroeconomics 390 acc, acc_norm
hendrycksTest-high_school_mathematics 270 acc, acc_norm
hendrycksTest-high_school_microeconomics 238 acc, acc_norm
hendrycksTest-high_school_physics 151 acc, acc_norm
hendrycksTest-high_school_psychology 545 acc, acc_norm
hendrycksTest-high_school_statistics 216 acc, acc_norm
hendrycksTest-high_school_us_history 204 acc, acc_norm
hendrycksTest-high_school_world_history 237 acc, acc_norm
hendrycksTest-human_aging 223 acc, acc_norm
hendrycksTest-human_sexuality 131 acc, acc_norm
hendrycksTest-international_law 121 acc, acc_norm
hendrycksTest-jurisprudence 108 acc, acc_norm
hendrycksTest-logical_fallacies 163 acc, acc_norm
hendrycksTest-machine_learning 112 acc, acc_norm
hendrycksTest-management 103 acc, acc_norm
hendrycksTest-marketing 234 acc, acc_norm
hendrycksTest-medical_genetics 100 acc, acc_norm
hendrycksTest-miscellaneous 783 acc, acc_norm
hendrycksTest-moral_disputes 346 acc, acc_norm
hendrycksTest-moral_scenarios 895 acc, acc_norm
hendrycksTest-nutrition 306 acc, acc_norm
hendrycksTest-philosophy 311 acc, acc_norm
hendrycksTest-prehistory 324 acc, acc_norm
hendrycksTest-professional_accounting 282 acc, acc_norm
hendrycksTest-professional_law 1534 acc, acc_norm
hendrycksTest-professional_medicine 272 acc, acc_norm
hendrycksTest-professional_psychology 612 acc, acc_norm
hendrycksTest-public_relations 110 acc, acc_norm
hendrycksTest-security_studies 245 acc, acc_norm
hendrycksTest-sociology 201 acc, acc_norm
hendrycksTest-us_foreign_policy 100 acc, acc_norm
hendrycksTest-virology 166 acc, acc_norm
hendrycksTest-world_religions 171 acc, acc_norm
iwslt17-ar-en 1460 bleu, chrf, ter
iwslt17-en-ar 1460 bleu, chrf, ter
lambada_openai 5153 ppl, acc
lambada_openai_cloze 5153 ppl, acc
lambada_openai_mt_de 5153 ppl, acc
lambada_openai_mt_en 5153 ppl, acc
lambada_openai_mt_es 5153 ppl, acc
lambada_openai_mt_fr 5153 ppl, acc
lambada_openai_mt_it 5153 ppl, acc
lambada_standard 5153 ppl, acc
lambada_standard_cloze 5153 ppl, acc
logiqa 651 acc, acc_norm
math_algebra 1187 acc
math_asdiv 2305 acc
math_counting_and_prob 474 acc
math_geometry 479 acc
math_intermediate_algebra 903 acc
math_num_theory 540 acc
math_prealgebra 871 acc
math_precalc 546 acc
mathqa 2985 acc, acc_norm
mc_taco 9442 f1, em
mgsm_bn 250 acc
mgsm_de 250 acc
mgsm_en 250 acc
mgsm_es 250 acc
mgsm_fr 250 acc
mgsm_ja 250 acc
mgsm_ru 250 acc
mgsm_sw 250 acc
mgsm_te 250 acc
mgsm_th 250 acc
mgsm_zh 250 acc
mnli 9815 acc
mnli_mismatched 9832 acc
mrpc 408 acc, f1
multirc 4848 acc
mutual 886 r@1, r@2, mrr
mutual_plus 886 r@1, r@2, mrr
nq_open 3610 em
openbookqa 500 acc, acc_norm
pawsx_de 2000 acc
pawsx_en 2000 acc
pawsx_es 2000 acc
pawsx_fr 2000 acc
pawsx_ja 2000 acc
pawsx_ko 2000 acc
pawsx_zh 2000 acc
pile_arxiv 2407 word_perplexity, byte_perplexity, bits_per_byte
pile_bookcorpus2 28 word_perplexity, byte_perplexity, bits_per_byte
pile_books3 269 word_perplexity, byte_perplexity, bits_per_byte
pile_dm-mathematics 1922 word_perplexity, byte_perplexity, bits_per_byte
pile_enron 1010 word_perplexity, byte_perplexity, bits_per_byte
pile_europarl 157 word_perplexity, byte_perplexity, bits_per_byte
pile_freelaw 5101 word_perplexity, byte_perplexity, bits_per_byte
pile_github 18195 word_perplexity, byte_perplexity, bits_per_byte
pile_gutenberg 80 word_perplexity, byte_perplexity, bits_per_byte
pile_hackernews 1632 word_perplexity, byte_perplexity, bits_per_byte
pile_nih-exporter 1884 word_perplexity, byte_perplexity, bits_per_byte
pile_opensubtitles 642 word_perplexity, byte_perplexity, bits_per_byte
pile_openwebtext2 32925 word_perplexity, byte_perplexity, bits_per_byte
pile_philpapers 68 word_perplexity, byte_perplexity, bits_per_byte
pile_pile-cc 52790 word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-abstracts 29895 word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-central 5911 word_perplexity, byte_perplexity, bits_per_byte
pile_stackexchange 30378 word_perplexity, byte_perplexity, bits_per_byte
pile_ubuntu-irc 22 word_perplexity, byte_perplexity, bits_per_byte
pile_uspto 11415 word_perplexity, byte_perplexity, bits_per_byte
pile_wikipedia 17511 word_perplexity, byte_perplexity, bits_per_byte
pile_youtubesubtitles 342 word_perplexity, byte_perplexity, bits_per_byte
piqa 1838 acc, acc_norm
prost 18736 acc, acc_norm
pubmedqa 1000 acc
qa4mre_2011 120 acc, acc_norm
qa4mre_2012 160 acc, acc_norm
qa4mre_2013 284 acc, acc_norm
qasper 1764 f1_yesno, f1_abstractive
qnli 5463 acc
qqp 40430 acc, f1
race 1045 acc
random_insertion 10000 acc
record 10000 f1, em
reversed_words 10000 acc
rte 277 acc
sciq 1000 acc, acc_norm
scrolls_contractnli 1037 em, acc, acc_norm
scrolls_govreport 972 rouge1, rouge2, rougeL
scrolls_narrativeqa 3425 f1
scrolls_qasper 984 f1
scrolls_qmsum 272 rouge1, rouge2, rougeL
scrolls_quality 2086 em, acc, acc_norm
scrolls_summscreenfd 338 rouge1, rouge2, rougeL
squad2 11873 exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1
sst 872 acc
swag 20006 acc, acc_norm
toxigen 940 acc, acc_norm
triviaqa 11313 acc
truthfulqa_gen 817 bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff
truthfulqa_mc 817 mc1, mc2
webqs 2032 acc
wic 638 acc
wikitext 62 word_perplexity, byte_perplexity, bits_per_byte
winogrande 1267 acc
wmt14-en-fr 3003 bleu, chrf, ter
wmt14-fr-en 3003 bleu, chrf, ter
wmt16-de-en 2999 bleu, chrf, ter
wmt16-en-de 2999 bleu, chrf, ter
wmt16-en-ro 1999 bleu, chrf, ter
wmt16-ro-en 1999 bleu, chrf, ter
wmt20-cs-en 664 bleu, chrf, ter
wmt20-de-en 785 bleu, chrf, ter
wmt20-de-fr 1619 bleu, chrf, ter
wmt20-en-cs 1418 bleu, chrf, ter
wmt20-en-de 1418 bleu, chrf, ter
wmt20-en-iu 2971 bleu, chrf, ter
wmt20-en-ja 1000 bleu, chrf, ter
wmt20-en-km 2320 bleu, chrf, ter
wmt20-en-pl 1000 bleu, chrf, ter
wmt20-en-ps 2719 bleu, chrf, ter
wmt20-en-ru 2002 bleu, chrf, ter
wmt20-en-ta 1000 bleu, chrf, ter
wmt20-en-zh 1418 bleu, chrf, ter
wmt20-fr-de 1619 bleu, chrf, ter
wmt20-iu-en 2971 bleu, chrf, ter
wmt20-ja-en 993 bleu, chrf, ter
wmt20-km-en 2320 bleu, chrf, ter
wmt20-pl-en 1001 bleu, chrf, ter
wmt20-ps-en 2719 bleu, chrf, ter
wmt20-ru-en 991 bleu, chrf, ter
wmt20-ta-en 997 bleu, chrf, ter
wmt20-zh-en 2000 bleu, chrf, ter
wnli 71 acc
wsc 104 acc
wsc273 273 acc
xcopa_et 500 acc
xcopa_ht 500 acc
xcopa_id 500 acc
xcopa_it 500 acc
xcopa_qu 500 acc
xcopa_sw 500 acc
xcopa_ta 500 acc
xcopa_th 500 acc
xcopa_tr 500 acc
xcopa_vi 500 acc
xcopa_zh 500 acc
xnli_ar 5010 acc
xnli_bg 5010 acc
xnli_de 5010 acc
xnli_el 5010 acc
xnli_en 5010 acc
xnli_es 5010 acc
xnli_fr 5010 acc
xnli_hi 5010 acc
xnli_ru 5010 acc
xnli_sw 5010 acc
xnli_th 5010 acc
xnli_tr 5010 acc
xnli_ur 5010 acc
xnli_vi 5010 acc
xnli_zh 5010 acc
xstory_cloze_ar 1511 acc
xstory_cloze_en 1511 acc
xstory_cloze_es 1511 acc
xstory_cloze_eu 1511 acc
xstory_cloze_hi 1511 acc
xstory_cloze_id 1511 acc
xstory_cloze_my 1511 acc
xstory_cloze_ru 1511 acc
xstory_cloze_sw 1511 acc
xstory_cloze_te 1511 acc
xstory_cloze_zh 1511 acc
xwinograd_en 2325 acc
xwinograd_fr 83 acc
xwinograd_jp 959 acc
xwinograd_pt 263 acc
xwinograd_ru 315 acc
xwinograd_zh 504 acc
Ceval-valid-computer_network 19 acc
Ceval-valid-operating_system 19 acc
Ceval-valid-computer_architecture 21 acc
Ceval-valid-college_programming 37 acc
Ceval-valid-college_physics 19 acc
Ceval-valid-college_chemistry 24 acc
Ceval-valid-advanced_mathematics 19 acc
Ceval-valid-probability_and_statistics 18 acc
Ceval-valid-discrete_mathematics 16 acc
Ceval-valid-electrical_engineer 37 acc
Ceval-valid-metrology_engineer 24 acc
Ceval-valid-high_school_mathematics 18 acc
Ceval-valid-high_school_physics 19 acc
Ceval-valid-high_school_chemistry 19 acc
Ceval-valid-high_school_biology 19 acc
Ceval-valid-middle_school_mathematics 19 acc
Ceval-valid-middle_school_biology 21 acc
Ceval-valid-middle_school_physics 19 acc
Ceval-valid-middle_school_chemistry 20 acc
Ceval-valid-veterinary_medicine 23 acc
Ceval-valid-college_economics 55 acc
Ceval-valid-business_administration 33 acc
Ceval-valid-marxism 19 acc
Ceval-valid-mao_zedong_thought 24 acc
Ceval-valid-education_science 29 acc
Ceval-valid-teacher_qualification 44 acc
Ceval-valid-high_school_politics 19 acc
Ceval-valid-high_school_geography 19 acc
Ceval-valid-middle_school_politics 21 acc
Ceval-valid-middle_school_geography 12 acc
Ceval-valid-modern_chinese_history 23 acc
Ceval-valid-ideological_and_moral_cultivation 19 acc
Ceval-valid-logic 22 acc
Ceval-valid-law 24 acc
Ceval-valid-chinese_language_and_literature 23 acc
Ceval-valid-art_studies 33 acc
Ceval-valid-professional_tour_guide 29 acc
Ceval-valid-legal_professional 23 acc
Ceval-valid-high_school_chinese 19 acc
Ceval-valid-high_school_history 20 acc
Ceval-valid-middle_school_history 22 acc
Ceval-valid-civil_servant 47 acc
Ceval-valid-sports_science 19 acc
Ceval-valid-plant_protection 22 acc
Ceval-valid-basic_medicine 19 acc
Ceval-valid-clinical_medicine 22 acc
Ceval-valid-urban_and_rural_planner 46 acc
Ceval-valid-accountant 49 acc
Ceval-valid-fire_engineer 31 acc
Ceval-valid-environmental_impact_assessment_engineer 31 acc
Ceval-valid-tax_accountant 49 acc
Ceval-valid-physician 49 acc