Skip to content

dependentsign/Awesome-LLM-based-Evaluators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

Awesome-LLM-based-Evaluators

In the realm of evaluating large language models, automated LLM-based evaluations have emerged as a scalable and efficient alternative to human evaluation. This repo includes papers about the LLM-based evaluators.

LLM-based evaluators

Title & Authors Venue Year Citation Count Code
Chateval: Towards better llm-based evaluators through multi-agent debate by Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu ICLR 2024 107 GitHub
Dyval: Graph-informed dynamic evaluation of large language models by Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie ICLR 2024 6 GitHub
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization by Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Wenjin Yao, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang ICLR 2024 52 GitHub
Can large language models be an alternative to human evaluations? by Cheng-Han Chiang, Hung-yi Lee ACL 2023 133 -
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models by Yen-Ting Lin, Yun-Nung Chen NLP4ConvAI 2023 42 -
Are large language model-based evaluators the solution to scaling up multilingual evaluation? by Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram EACL Findings 2024 12 GitHub
Judging llm-as-a-judge with mt-bench and chatbot arena by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica NeurIPS 2023 510 GitHub
Calibrating LLM-Based Evaluator by Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang arXiv 2023 7 -
LLM-based NLG Evaluation: Current Status and Challenges by Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, Xiaojun Wan arXiv 2024 1 -
Are LLM-based Evaluators Confusing NLG Quality Criteria? by Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan arXiv 2024 - -
PRE: A Peer Review Based Large Language Model Evaluator by Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu arXiv 2024 1 GitHub
Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning by Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad arXiv 2023 4 -
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate by Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu arXiv 2024 - GitHub
Split and merge: Aligning position biases in large language model based evaluators by Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu arXiv 2023 8
One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation by Tejpalsingh Siledar, Swaroop Nath, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Swaprava Nath, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera arXiv 2024 - GitHub
Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation by Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang arXiv 2023 9 GitHub
Is chatgpt a good nlg evaluator? a preliminary study by Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou NewSumm@EMNLP 2023 179 Github
G-eval: Nlg evaluation using gpt-4 with better human alignment by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu EMNLP 2023 381 Github
GPTScore: Evaluate as You Desire by Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu EMNLP 2023 228 Github
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study by Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu arXiv 2023 39 -
Evaluating General-Purpose AI with Psychometrics by Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, Xing Xie arXiv 2023 3 -

Psychometrics in LLMs evaluation

Title & Authors Venue Year Citation Count Code
InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews by Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, Yanghua Xiao arXiv 2023 16 Github
Who is ChatGPT? Benchmarking LLMs’ Psychological Portrayal Using Psycho Bench by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu ICLR 2024 10 Github
On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael Lyu ICLR 2024 2 Github
Evaluating and Inducing Personality in Pre-trained Language Models by Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, Yixin Zhu NeurIPS 2023 47 Github
Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective by Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, Enhong Chen arXiv 2023 19 -
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao arXiv 2023 49 Github
LLM Agents for Psychology: A Study on Gamified Assessments by Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, Gao Huang arXiv 2024 - -
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng and Heng Ji ICLR 2024 37 Github
GPT-4’s assessment of its performance in a USMLE-based case study by Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, Chandra Dhakal arXiv 2024 1 -

About

✨✨Latest Papers about LLM-based Evaluators

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published