Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval generates answer same as dataset #16

Open
shaswati1 opened this issue Mar 19, 2024 · 9 comments
Open

eval generates answer same as dataset #16

shaswati1 opened this issue Mar 19, 2024 · 9 comments

Comments

@shaswati1
Copy link

shaswati1 commented Mar 19, 2024

I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example,
Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?
Answer: The author's name is Hina Ameen.

Generated Answer: The author's name is Hina Ameen.

Also, the p-value is substantially low (7.82e-19)
Am I interpreting the evaluated results correctly?

@molereddy
Copy link

In eval_log_forget.json, "generated_text" key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not run grad_ascent but ran grad_diff and got the below result.
["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],

@shaswati1
Copy link
Author

shaswati1 commented Mar 23, 2024

In eval_log_forget.json, "generated_text" key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not run grad_ascent but ran grad_diff and got the below result. ["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],

Did you get the above results on llama2? I am also looking at the 2nd element of that list and seems to get the same as the ground truth. Can you share your eval stat with me if possible?

@molereddy
Copy link

I'm working with Phi-1.5.
I'm not sure what you mean by eval stat.
One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.

@molereddy
Copy link

@shaswati1
Copy link
Author

I'm working with Phi-1.5. I'm not sure what you mean by eval stat. One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.

My results from before and after refactoring look similar corresponding to the generated answer though the aggregate_stat is slightly different! By eval_stat I mean the aggregate_stat where you get to see the scores like forget quality and model utility.

@molereddy
Copy link

I see. It would be good to understand what exactly the refactor changed. @zhilif?

@molereddy
Copy link

@shaswati1 can you run grad diff and check the generations? That will tell us if the issue is with the method or something else you are doing. Since grad diff definitely works for me (with Phi). Maybe you can try Phi instead of Llama as well.

@zhilif
Copy link
Collaborator

zhilif commented Mar 29, 2024

One thing we noticed is that llama2 results are not exactly reproducible when flash_attention is enabled.

@zhilif
Copy link
Collaborator

zhilif commented Mar 29, 2024

I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example, Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975? Answer: The author's name is Hina Ameen.

Generated Answer: The author's name is Hina Ameen.

Also, the p-value is substantially low (7.82e-19) Am I interpreting the evaluated results correctly?

How many steps have you trained? Also is the p-value tested against the retain model? A small p-value means this model is very different from the retain model, which should be the case in your scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants