You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
print("Reliability: ", sum([i["post"]["rewrite_acc"][0] for i in metrics])/len(metrics))
print("Generalization: ", sum([i["post"]["rephrase_acc"][0] for i in metrics])/len(metrics))
print("Locality: ", sum([i["post"]["locality"]["neighborhood_acc"][0] for i in metrics])/len(metrics))
print("Portability: ", sum([i["post"]["portability"]["one_hop_acc"][0] for i in metrics])/len(metrics))
Thank you for your time to reproduce the results of reliability, generalization, and locality. EasyEdit's metrics have gone through several versions of iterations, which may be the reason for the inconsistency in measuring portability. Please give me some time, I will evaluate the corresponding indicators and give you feedback in time. 🥹
EasyEdit appreciates your feedback. I reviewed the experimental results and found that the performance on GPT-J is still less than 5% (FT). You can refer to the paper for this: https://arxiv.org/abs/2305.13172
Using Llama, I reproduced your results. I think the reason is that there is a special token in the tokenizer of llama, and there was no special processing in the original version, resulting in poor results. No need to be confused, you can refer to the currently reproduced metrics. I will repeat the experiment on llama and update ReadME and EasyEdit Paper in time.
您好,我在复现你们使用EasyEdit在LlaMA-2-7B上的指标的时候遇到了问题。如下图是你们在README中给出的测试结果
我尝试复现其中FT的结果,我使用命令
llama-7b.yaml的内容是
使用的模型来自https://huggingface.co/meta-llama/Llama-2-7b-hf
测试数据集是zsre_mend_eval_portability_gpt4.json的前100项
我添加了四行代码用于测试最终结果:
其中Reliability, Generalization和Locality的测试结果和您的结果接近,但是Portability却达到了54.05
我还尝试在edit函数中加入了summary_metrics=True的参数,以此来调用你们这边的指标总结功能,发现测出来的portability的one-hop-acc同样是这个值。
我还去测试了MEND的结果,发现Portability同样是高了非常多,是53.75.
我感到很困惑,是我的测试方法不对吗?
The text was updated successfully, but these errors were encountered: