Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce videochatgpt video benchmark #175

Open
Leo-Yuyang opened this issue May 16, 2024 · 2 comments
Open

Cannot reproduce videochatgpt video benchmark #175

Leo-Yuyang opened this issue May 16, 2024 · 2 comments

Comments

@Leo-Yuyang
Copy link

Dear author, I found that in your paper, you claimed a very impressive performance on videochatgpt video benchmark. However, I didn't find related code about reproducing this experiment. So I modified the mvbench evaluation code to inference on this task. But I can't reproduce it.
b703402d502870e4daff910c65caf6a
What I got is:
image
The model is the right one because I used the same model and reproduced the performance on MVbench.
So the only difference might be the different prompt used when testing the dataset, however I can't really believe that a different prompt can lead to such a big difference.
So could you please give me the prompt used when testing this benchmark or the inference result of this? Extraordinary claims require extraordinary evidence.

@Andy1621
Copy link
Collaborator

Can you reproduce the results of other models? Recently, one paper revealed that the GPT version affects the final results.
image

@Andy1621
Copy link
Collaborator

I have checked the history and we use gpt-3.5-turbo by default. Considering the testing time, we may use gpt-3.5-turbo-1106 and the results are as follows:

completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for correctness: 3.020541082164329
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for detailed orientation: 2.875250501002004
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for contextual understanding: 3.509018036072144
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score temporal understanding: 2.661322645290581
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score for consistency: 2.8076152304609217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants