Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are the metrics: acc, precision, recall of claim level calculated? #35

Open
LeiyanGithub opened this issue Sep 15, 2023 · 4 comments

Comments

@LeiyanGithub
Copy link

How are the acc, precision, recall, and other indicators of claim level calculated? Because the claims extracted from the model are definitely different? It is challenging to calculate indicators without manual evaluation.
Is the calculation of the indicators here based on the ground truth annotated in the dataset? The default claim is given, and the correctness of each claim is known, corresponding to the label in the dataset. Collecting evidence based on the given claim, and then verify the correctness of the claim and its consistency with the ground truth, in order to conduct factual verification. I don't know if my understanding is correct. Isn't this part of the code missing in the repo?
image

@EthanC111
Copy link
Collaborator

Hi @LeiyanGithub, thank you for your interest in our paper and for reaching out!

You've understood correctly. We used the dataset that we constructed for our experiments. The claims in this dataset are extracted from ChatGPT (gpt-3.5-turbo). This dataset contains both the claim-level and response-level annotations for each sample. To obtain the score for each metric (accuracy, recall, precision, and F1 Score), we compare the labels predicted by Factool with the annotations of the dataset.

Let me know if you have more questions!

@LeiyanGithub
Copy link
Author

Thank you so much for your patience and detailed explanation. I am not sure if the metric evaluation code is present in the repo. Is it possible to publish this part code?

@LeiyanGithub
Copy link
Author

LeiyanGithub commented Sep 18, 2023

Why can't the results reproduced on the kbqa dataset match?
After running 233 claims in the dataset, all metrics of both claim level and response level is lower than that presented in the paper

image

@EthanC111
Copy link
Collaborator

Hi @LeiyanGithub, thanks for asking! I will upload our result and the metric evaluation code in the next few days. However, I want to apologize in advance, as I am currently finalizing another project, so I might not be able to upload it instantly.

In general, one of the main reasons the results might not match is that the APIs are constantly changing (both gpt-3.5-turbo and serper). It's also possible that the result returned by gpt-3.5-turbo is a null value, which could be due to issues like OpenAI's unstable API calls or rate limits. For this part, you might need to run it again to ensure it's not a null value. It can be a bit challenging to determine why the results are different based on a screenshot alone, but thank you for sharing it! If you could provide more details, I could offer more support.

I usually recommend using GPT-4 as the foundational model for KB-QA. As you can see in the paper, due to the limited reasoning capability of GPT-3.5, Factool powered by GPT-3.5 is not the best option. Factool powered by GPT-4 offers significantly better user experiences and is generally considered the default choice for Factool.

Let me know if you have more questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants