Skip to content

Latest commit

 

History

History
1161 lines (580 loc) · 52 KB

EvaluationReliabilityList.md

File metadata and controls

1161 lines (580 loc) · 52 KB

📄 Evaluation and Reliability

Paper List

Evaluating LLMs at Detecting Errors in LLM Responses2024.04.04

Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, etc


Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers2024.04.04

Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang


Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models2024.03.29

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, etc . - 【arXiv.org】


ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models2024.03.08

Jio Oh, Soyeon Kim, Junseok Seo, Jindong Wang, Ruochen Xu, etc


Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation2024.03.05

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, etc


Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation2024.03.04

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh


A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models2024.02.28

Xiujie Song, Mengyue Wu, Ke Zhu, Chunhao Zhang, Yanyi Chen


Evaluating Very Long-Term Conversational Memory of LLM Agents2024.02.27

A. Maharana, Dong-Ho Lee, S. Tulyakov, Mohit Bansal, Francesco Barbieri, etc . - 【arXiv.org】


Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs2024.02.21

Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Hansheng Fang, Aishan Liu, etc . - 【arXiv.org】


TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization2024.02.20

Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, etc


How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis2024.02.08

Federico Bianchi, P. Chia, Mert Yüksekgönül, Jacopo Tagliabue, Daniel Jurafsky, etc . - 【arXiv.org】


Can Large Language Models Understand Context?2024.02.01

Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, etc . - 【arXiv.org】


Evaluating Large Language Models for Generalization and Robustness via Data Compression2024.02.01

Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin . - 【arXiv.org】


VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks2024.01.24

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, etc . - 【arXiv.org】


Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/42023.12.26

S. Bsharat, Aidar Myrzakhan, Zhiqiang Shen . - 【arXiv.org】


TouchStone: Evaluating Vision-Language Models by Language Models2023.08.31

Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xing Zhang, etc . - 【arXiv.org】


Shepherd: A Critic for Language Model Generation2023.08.08

Tianlu Wang, Ping Yu, Xiaoqing Tan, Sean O'Brien, Ramakanth Pasunuru, etc . - 【arXiv.org】


Self-consistency for open-ended generations2023.07.11

Siddhartha Jain, Xiaofei Ma, Anoop Deoras, Bing Xiang . - 【arXiv.org】


Jailbroken: How Does LLM Safety Training Fail?2023.07.05

Alexander Wei, Nika Haghtalab, J. Steinhardt . - 【arXiv.org】


Towards Measuring the Representation of Subjective Global Opinions in Language Models2023.06.28

Esin Durmus, Karina Nyugen, Thomas Liao, Nicholas Schiefer, Amanda Askell, etc . - 【arXiv.org】


On the Reliability of Watermarks for Large Language Models2023.06.07

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, etc . - 【arXiv.org】


SETI: Systematicity Evaluation of Textual Inference2023.05.24

Xiyan Fu, Anette Frank


From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions2023.05.24

Peter Jansen


Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples2023.05.24

Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, etc


EvEval: A Comprehensive Evaluation of Event Semantics for Large Language Models2023.05.24

Zhengwei Tao, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yanlin Feng, etc


Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions2023.05.24

Jiahuan Li, Hao Zhou, Shujian Huang, Shanbo Chen, Jiajun Chen


HuatuoGPT, towards Taming Language Model to Be a Doctor2023.05.24

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, etc


Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models2023.05.24

Daman Arora, Himanshu Gaurav Singh, Mausam


Is GPT-4 a Good Data Analyst?2023.05.24

Liying Cheng, Xingxuan Li, Lidong Bing


ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories2023.05.24

Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, etc


Sentiment Analysis in the Era of Large Language Models: A Reality Check2023.05.24

Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing


A RelEntLess Benchmark for Modelling Graded Relations between Named Entities2023.05.24

Asahi Ushio, Jose Camacho Collados, S. Schockaert


IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models2023.05.24

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, etc


GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP2023.05.24

Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed


Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback2023.05.24

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, etc


PESCO: Prompt-enhanced Self Contrastive Learning for Zero-shot Text Classification2023.05.24

Yau-Shian Wang, Ta-Chung Chi, Ruohong Zhang, Yiming Yang


Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective2023.05.24

Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao


ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games2023.05.24

Ruoyao Wang, Graham Todd, Eric Yuan, Ziang Xiao, Marc-Alexandre Cot'e, etc


Estimating Large Language Model Capabilities without Labeled Test Data2023.05.24

Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia


Faithful Low-Resource Data-to-Text Generation through Cycle Training2023.05.24

Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, etc


Large Language Models as Counterfactual Generator: Strengths and Weaknesses2023.05.24

Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, T. Qian


ChatGPT and Simple Linguistic Inferences: Blind Spots and Blinds2023.05.24

Victoria Basmov, Yoav Goldberg, Reut Tsarfaty


Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models2023.05.24

Amirhossein Kazemnejad, Mehdi Rezagholizadeh, Prasanna Parthasarathi, Sarath Chandar


Using Natural Language Explanations to Rescale Human Judgments2023.05.24

Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett


Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models2023.05.24

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, etc


ECHo: Event Causality Inference via Human-centric Reasoning2023.05.24

Yuxi Xie, Guanzhen Li, MingSung Kan


In-Context Demonstration Selection with Cross Entropy Difference2023.05.24

Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, etc


I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors2023.05.24

Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, etc


Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning2023.05.24

Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng


Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models2023.05.24

Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen


Can Transformers Learn to Solve Problems Recursively?2023.05.24

Shizhuo Dylan Zhang, Curt Tigges, Stella Rose Biderman, M. Raginsky, T. Ringer


Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs2023.05.24

Xiaoyang Song, Akshat Gupta, Kiyan Mohebbizadeh, Shujie Hu, Anant Singh


Enabling Large Language Models to Generate Text with Citations2023.05.24

Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen


Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy2023.05.24

Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal


WikiChat: A Few-Shot LLM-Based Chatbot Grounded with Wikipedia2023.05.23

Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, M. Lam


Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis2023.05.23

Mingda Chen, Xilun Chen, Wen-tau Yih


Learn from Mistakes through Cooperative Interaction with Study Assistant2023.05.23

Danqing Wang, Lei Li


RET-LLM: Towards a General Read-Write Memory for Large Language Models2023.05.23

Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze


ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding2023.05.23

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy


Evaluating Factual Consistency of Summaries with Large Language Models2023.05.23

Shiqi Chen, Siyang Gao, Junxian He


Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge Retrieval from Foundation Language Models2023.05.23

Tim Schott, Daniel Furman, Shreshta Bhat


Detecting and Mitigating Indirect Stereotypes in Word Embeddings2023.05.23

Erin George, Joyce Chew, Deanna Needell


PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents2023.05.23

Simeng Sun, Yang Liu, Shuohang Wang, Chenguang Zhu, Mohit Iyyer


Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations2023.05.23

Tiziano Labruna, Sofia Brenna, Andrea Zaninello, Bernardo Magnini


Sources of Hallucination by Large Language Models on Inference Tasks2023.05.23

Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, etc


LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond2023.05.23

Philippe Laban, Wojciech Kry'sci'nski, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, etc


Automatic Model Selection with Large Language Models for Reasoning2023.05.23

Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Qizhe Xie


Cascaded Beam Search: Plug-and-Play Terminology-Forcing For Neural Machine Translation2023.05.23

Fr'ed'eric Odermatt, B'eni Egressy, Roger Wattenhofer


CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation2023.05.23

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, etc


Deduction under Perturbed Evidence: Probing Student Simulation Capabilities of Large Language Models2023.05.23

Shashank Sonkar, Richard Baraniuk


Prompt position really matters in few-shot and zero-shot NLU tasks2023.05.23

Junyu Mao, Stuart E. Middleton, Mahesan Niranjan


CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains2023.05.23

Xuanyu Zhang, Bingbing Li, Qing Yang


Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA2023.05.23

David Heineman, Yao Dou, Mounica Maddela, Wei Xu


Pre-training Language Models for Comparative Reasoning2023.05.23

Mengxia Yu, Zhihan Zhang, Wenhao Yu, Meng Jiang


Having Beer after Prayer? Measuring Cultural Bias in Large Language Models2023.05.23

Tarek Naous, Michael J. Ryan, Wei Xu


On Robustness of Finetuned Transformer-based NLP Models2023.05.23

Pavan Kalyan Reddy Neerudu, Subba Reddy Oota, Mounika Marreddy, Venkateswara Rao Kagita, Manish Gupta


Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors2023.05.23

Ridong Han, T. Peng, Chaohao Yang, Benyou Wang, Lu Liu, etc


Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions2023.05.23

Zhihan Zhang, Wenhao Yu, Zheng Ning, Mingxuan Ju, Meng Jiang


Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction2023.05.23

Yew Ken Chia, Hui Chen, Wei Han, Guizhen Chen, Sharifah Mahani Aljunied, etc


TaDSE: Template-aware Dialogue Sentence Embeddings2023.05.23

Minsik Oh, Jiwei Li, Guoyin Wang


WebIE: Faithful and Robust Information Extraction on the Web2023.05.23

Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, Christos Christodoulopoulos, Andrea Pierleoni


Exploring Representational Disparities Between Multilingual and Bilingual Translation Models2023.05.23

Neha Verma, Kenton Murray, Kevin Duh


How Old is GPT?: The HumBEL Framework for Evaluating Language Models using Human Demographic Data2023.05.23

Anthony Sicilia, Jennifer C. Gates, Malihe Alikhani


Out-of-Distribution Generalization in Text Classification: Past, Present, and Future2023.05.23

Linyi Yang, Y. Song, Xuan Ren, Chenyang Lyu, Yidong Wang, etc


Target-Agnostic Gender-Aware Contrastive Learning for Mitigating Bias in Multilingual Machine Translation2023.05.23

Minwoo Lee, Hyukhun Koh, Kang-il Lee, Dongdong Zhang, Minsung Kim, etc


LLM-powered Data Augmentation for Enhanced Crosslingual Performance2023.05.23

Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji


FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation2023.05.23

Sewon Min, Kalpesh Krishna, Xinxi Lyu, M. Lewis, Wen-tau Yih, etc


On Learning to Summarize with Large Language Models as References2023.05.23

Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan


Reducing Sensitivity on Speaker Names for Text Generation from Dialogues2023.05.23

Qi Jia, Haifeng Tang, Kenny Q. Zhu


Self-Critique Prompting with Large Language Models for Inductive Instructions2023.05.23

Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, etc


ChipGPT: How far are we from natural language hardware design2023.05.23

Kaiyan Chang, Ying Wang, Haimeng Ren, Mengdi Wang, Shengwen Liang, etc


Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning2023.05.23

Xiao Yu, Maximillian Chen, Zhou Yu


Improving Self-training for Cross-lingual Named Entity Recognition with Contrastive and Prototype Learning2023.05.23

Ran Zhou, Xin Li, Lidong Bing, E. Cambria, Chun Miao


Better Low-Resource Entity Recognition Through Translation and Annotation Fusion2023.05.23

Yang Chen, Vedaant Shah, Alan Ritter


Cross-functional Analysis of Generalisation in Behavioural Learning2023.05.22

Pedro Henrique Luz de Araujo, Benjamin Roth


Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations2023.05.22

Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, etc


LM vs LM: Detecting Factual Errors via Cross Examination2023.05.22

Roi Cohen, May Hamri, Mor Geva, A. Globerson


SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning for Jointly Generating Predictions and Natural Language Explanations2023.05.22

Jesus Solano, Oana-Maria Camburu, Pasquale Minervini


SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation2023.05.22

Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, etc


Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models2023.05.22

Hao Wang, Hirofumi Shimizu, Daisuke Kawahara


Beyond Labels: Empowering Human with Natural Language Explanations through a Novel Active-Learning Architecture2023.05.22

Bingsheng Yao, Ishan Jindal, Lucian Popa, Yannis Katsis, Sayan Ghosh, etc


llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology2023.05.22

Masanori Hirano, Masahiro Suzuki, Hiroki Sakaji


AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback2023.05.22

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, etc


Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale2023.05.22

M. Costa-jussà, Pierre Yves Andrews, Eric J. M. Smith, Prangthip Hansanti, Christophe Ropers, etc


Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design2023.05.22

Ibrahim M. Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, L. Beyer


MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering2023.05.22

Vaishali Pal, Andrew Yates, E. Kanoulas, M. de Rijke


Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis2023.05.22

Seraphina Goldfarb-Tarrant, Björn Ross, Adam Lopez


Model Analysis&Evaluation for Ambiguous Question Answering2023.05.21

Konstantinos Papakostas, Irene Papadopoulou


TheoremQA: A Theorem-driven Question Answering dataset2023.05.21

Wenhu Chen, Ming Yin, Max Ku, Yixin Wan, Xueguang Ma, etc


Evaluating the Performance of Large Language Models on GAOKAO Benchmark2023.05.21

Xiaotian Zhang, Chun-yan Li, Yi Zong, Zhengyu Ying, Liang He, etc


Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?2023.05.20

Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, etc


Evaluation of medium-large Language Models at zero-shot closed book generative question answering2023.05.19

Ren'e Peinl, Johannes Wirth


Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses2023.05.19

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes . - 【arXiv.org】


OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models2023.05.19

Badr AlKhamissi, Siddharth Verma, Ping Yu, Zhijing Jin, Asli Celikyilmaz, etc


Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate2023.05.19

Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, Bing Qin . - 【arXiv.org】


TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks2023.05.19

Shubhra (Santu) Karmaker, Dongji Feng . - 【arXiv.org】


Efficient Prompting via Dynamic In-Context Learning2023.05.18

Wangchunshu Zhou, Yuchen Jiang, Ryan Cotterell, Mrinmaya Sachan . - 【arXiv.org】


Qualifying Chinese Medical Licensing Examination with Knowledge Enhanced Generative Pre-training Model2023.05.17

Jiageng Wu, X. Wu, Zhaopeng Qiu, Minghui Li, Yefeng Zheng, etc . - 【arXiv.org】


Can Language Models Solve Graph Problems in Natural Language?2023.05.17

Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, etc . - 【arXiv.org】


AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models2023.04.13

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, etc


GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment2023.03.29

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, etc


How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks2023.03.01

Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, etc . - 【ArXiv】


Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints2023.02.17

Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, Diyi Yang . - 【ArXiv】


Evaluating the Robustness of Discrete Prompts2023.02.11

Yoichi Ishibashi, D. Bollegala, Katsuhito Sudoh, Satoshi Nakamura . - 【ArXiv】


Controlling for Stereotypes in Multimodal Language Model Evaluation2023.02.03

Manuj Malik, Richard Johansson . - 【BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP】


Large Language Models Can Be Easily Distracted by Irrelevant Context2023.01.31

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, etc . - 【ArXiv】


Emergent Analogical Reasoning in Large Language Models2022.12.19

Taylor W. Webb, K. Holyoak, Hongjing Lu . - 【ArXiv】


Discovering Language Model Behaviors with Model-Written Evaluations2022.12.19

Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, etc . - 【ArXiv】


Constitutional AI: Harmlessness from AI Feedback2022.12.15

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, etc . - 【ArXiv】


On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning2022.12.15

Omar Shaikh, Hongxin Zhang, William B. Held, Michael Bernstein, Diyi Yang . - 【ArXiv】


Demystifying Prompts in Language Models via Perplexity Estimation2022.12.08

Hila Gonen, Srini Iyer, Terra Blevins, Noah A. Smith, Luke Zettlemoyer . - 【ArXiv】


Solving math word problems with process- and outcome-based feedback2022.11.25

J. Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, etc . - 【ArXiv】


Holistic Evaluation of Language Models2022.11.16

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, etc . - 【ArXiv】


Can language models handle recursively nested grammatical structures? A case study on comparing models and humans2022.10.27

Andrew Kyle Lampinen . - 【ArXiv】


Prompting GPT-3 To Be Reliable2022.10.17

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, etc . - 【ArXiv】


An Interpretability Evaluation Benchmark for Pre-trained Language Models2022.07.28

Ya-Ming Shen, Lijie Wang, Ying Chen, Xinyan Xiao, Jing Liu, etc . - 【ArXiv】


Re-Examining Calibration: The Case of Question Answering2022.05.25

Chenglei Si, Chen Zhao, Sewon Min, Jordan L. Boyd-Graber . - 【Conference on Empirical Methods in Natural Language Processing】


Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing2022.05.24

Linlu Qiu, Peter Shaw, Panupong Pasupat, Tianze Shi, Jonathan Herzig, etc . - 【Conference on Empirical Methods in Natural Language Processing】


The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning2022.05.06

Xi Ye, Greg Durrett


Training Verifiers to Solve Math Word Problems2021.10.27

Karl Cobbe, V. Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, etc . - 【ArXiv】


BBQ: A hand-built bias benchmark for question answering2021.10.15

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, etc . - 【Findings】


BARTScore: Evaluating Generated Text as Text Generation2021.06.22

Weizhe Yuan, Graham Neubig, Pengfei Liu . - 【Neural Information Processing Systems】


Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning2021.06.13

Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, Xiang Ren . - 【Annual Meeting of the Association for Computational Linguistics】


Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity2021.04.18

Yao Lu, Max Bartolo, Alastair Moore, S. Riedel, Pontus Stenetorp . - 【Annual Meeting of the Association for Computational Linguistics】

CONTINUE...