Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] There is a large gap in the partial subset of the long text evaluation data set #1061

Open
2 tasks done
bullw opened this issue Apr 19, 2024 · 0 comments
Open
2 tasks done
Assignees

Comments

@bullw
Copy link

bullw commented Apr 19, 2024

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

OpenCompass 0.2.3
transformers 4.35.2
GPU A100

Reproduces the problem - code/configuration sample

None

Reproduces the problem - command or script

python run.py --datasets longbench leval \
              --hf-path /code/open_model/chatglm2-6b-32k \
              --model-kwargs device_map='auto' \
              --max-seq-len 32768 \
              --batch-size 1 \
              --max-out-len 512 \
              --num-gpus 1 \
              --max-partition-size 5000 \
              --max-workers-per-gpu 3 \
              --engine torch

Reproduces the problem - error message

I reproduced most of the scores using the official configuration, But there are a few subsets of the score gap larger please help me analyze.

There is a big difference between the scores below and those on the list:

  • leval summary
dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_open_model_chatglm2-6b-32k
LEval_nq,52c33f,rouge1,gen,26.77
LEval_nq,52c33f,rouge2,gen,16.55
LEval_nq,52c33f,rougeL,gen,26.64
LEval_nq,52c33f,rougeLsum,gen,26.54
LEval_narrativeqa,766dd0,rouge1,gen,8.82
LEval_narrativeqa,766dd0,rouge2,gen,1.52
LEval_narrativeqa,766dd0,rougeL,gen,8.01
LEval_narrativeqa,766dd0,rougeLsum,gen,8.18
LEval_coursera,36a006,accuracy,gen,36.05
LEval_topic_retrieval,bf433f,score,gen,52.67
  • longbench summary
dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_open_model_chatglm2-6b-32k
LongBench_trec,824187,score,gen,62.00
LongBench_lsht,e8a339,score,gen,29.92
LongBench_narrativeqa,a68305,score,gen,7.30

I need the following questions:

  1. The scores for the four datasets - Coursera, NarrativeQA, NQ, and Topic Retrieval - evaluated by Leval show significant differences, with respective differences of -9.3, -9.42, -14.29, and 8 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"

  2. The scores for the four datasets - NarrativeQA, TREC, LSHT (zh), and Topic Retrieval - evaluated by longbench show significant differences, with respective differences of -10.94, 31.04, and 7.17 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"

  3. For the TREC LSHT (zh) subset of the longbench dataset, I found the corresponding scores on the longbench rank (https://github.com/THUDM/LongBench/blob/main/README.md) to be not significantly different from my results. Should I rely on the OpenCompass leaderboard or the longbench leaderboard?

Other information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants