Skip to content

marcopoli/bbh_ita

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Italian version of the BHH Dataset

Dataset based on the Italian translation provided by:

Citations

@article{suzgun2022challenging,
title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
journal={arXiv preprint arXiv:2210.09261},
year={2022}
}

@inproceedings{RanaldiPRZF23,
author       = {Leonardo Ranaldi and
                Giulia Pucci and
                Elena Sofia Ruzzetti and
                Fabio Massimo Zanzotto and
                Andr{\'{e}} Freitas},
title        = {Teasing LLMs Adapted to Italian},
booktitle    = {Proceedings of the 9th Italian Conference on Computational Linguistics,
                Venice, Italy, November 30 - December 2, 2023},
series       = {{CEUR} Workshop Proceedings},
volume       = {3596},
publisher    = {CEUR-WS.org},
year         = {2023},
url          = {https://ceur-ws.org/Vol-3596/short18.pdf},
timestamp    = {Tue, 02 Jan 2024 17:44:44 +0100},
}

@misc{basile2023llamantino,
    title={LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language}, 
    author={Pierpaolo Basile and Elio Musacchio and Marco Polignano and Lucia Siciliani and Giuseppe Fiameni and Giovanni Semeraro},
    year={2023},
    eprint={2312.09993},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Description

BBH focuses on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average humanrater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Homepage

URL: https://github.com/suzgunmirac/BIG-Bench-Hard

About

The Italian Version of the BBH dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published