You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear author, I found that in your paper, you claimed a very impressive performance on videochatgpt video benchmark. However, I didn't find related code about reproducing this experiment. So I modified the mvbench evaluation code to inference on this task. But I can't reproduce it.
What I got is:
The model is the right one because I used the same model and reproduced the performance on MVbench.
So the only difference might be the different prompt used when testing the dataset, however I can't really believe that a different prompt can lead to such a big difference.
So could you please give me the prompt used when testing this benchmark or the inference result of this? Extraordinary claims require extraordinary evidence.
The text was updated successfully, but these errors were encountered:
I have checked the history and we use gpt-3.5-turbo by default. Considering the testing time, we may use gpt-3.5-turbo-1106 and the results are as follows:
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for correctness: 3.020541082164329
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for detailed orientation: 2.875250501002004
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for contextual understanding: 3.509018036072144
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score temporal understanding: 2.661322645290581
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score for consistency: 2.8076152304609217
Dear author, I found that in your paper, you claimed a very impressive performance on videochatgpt video benchmark. However, I didn't find related code about reproducing this experiment. So I modified the mvbench evaluation code to inference on this task. But I can't reproduce it.
What I got is:
The model is the right one because I used the same model and reproduced the performance on MVbench.
So the only difference might be the different prompt used when testing the dataset, however I can't really believe that a different prompt can lead to such a big difference.
So could you please give me the prompt used when testing this benchmark or the inference result of this? Extraordinary claims require extraordinary evidence.
The text was updated successfully, but these errors were encountered: