Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

测试结果复现(LLaMA Portability bug已修复,结果待更正) #227

Open
zhai-yx opened this issue Apr 12, 2024 · 3 comments
Open
Assignees
Labels
question Further information is requested

Comments

@zhai-yx
Copy link

zhai-yx commented Apr 12, 2024

您好,我在复现你们使用EasyEdit在LlaMA-2-7B上的指标的时候遇到了问题。如下图是你们在README中给出的测试结果
123
我尝试复现其中FT的结果,我使用命令

python run_zsre_llama2.py \
    --editing_method=FT \
    --hparams_dir=../hparams/FT/llama-7b \
    --data_dir=./data

llama-7b.yaml的内容是

  alg_name: "FT"
  model_name: "../hugging_cache/llama-2-7b"
  device: 0
  layers: [21]
  num_steps: 25
  batch_size: 1
  max_length: 40
  lr: 5e-4
  weight_decay: 0
  kl_factor: 0
  norm_constraint: false
  objective_optimization: "prompt_last"
  rewrite_module_tmp: "model.layers.{}.mlp.down_proj.weight"
  layer_module_tmp: "model.layers.{}"
  mlp_module_tmp: "model.layers.{}.mlp"
  attn_module_tmp: "model.layers.{}.self_attn"
  ln_f_module: "model.norm"
  lm_head_module: "lm_head"
  model_parallel: false

使用的模型来自https://huggingface.co/meta-llama/Llama-2-7b-hf
测试数据集是zsre_mend_eval_portability_gpt4.json的前100项
我添加了四行代码用于测试最终结果:

print("Reliability: ", sum([i["post"]["rewrite_acc"][0] for i in metrics])/len(metrics))
print("Generalization: ", sum([i["post"]["rephrase_acc"][0] for i in metrics])/len(metrics))
print("Locality: ", sum([i["post"]["locality"]["neighborhood_acc"][0] for i in metrics])/len(metrics))
print("Portability: ", sum([i["post"]["portability"]["one_hop_acc"][0] for i in metrics])/len(metrics))

其中Reliability, Generalization和Locality的测试结果和您的结果接近,但是Portability却达到了54.05
我还尝试在edit函数中加入了summary_metrics=True的参数,以此来调用你们这边的指标总结功能,发现测出来的portability的one-hop-acc同样是这个值。

我还去测试了MEND的结果,发现Portability同样是高了非常多,是53.75.
我感到很困惑,是我的测试方法不对吗?

@pengzju
Copy link
Collaborator

pengzju commented Apr 13, 2024

Thank you for your time to reproduce the results of reliability, generalization, and locality. EasyEdit's metrics have gone through several versions of iterations, which may be the reason for the inconsistency in measuring portability. Please give me some time, I will evaluate the corresponding indicators and give you feedback in time. 🥹

@pengzju pengzju added the question Further information is requested label Apr 13, 2024
@pengzju pengzju self-assigned this Apr 13, 2024
@pengzju
Copy link
Collaborator

pengzju commented Apr 13, 2024

EasyEdit appreciates your feedback. I reviewed the experimental results and found that the performance on GPT-J is still less than 5% (FT). You can refer to the paper for this: https://arxiv.org/abs/2305.13172

Using Llama, I reproduced your results. I think the reason is that there is a special token in the tokenizer of llama, and there was no special processing in the original version, resulting in poor results. No need to be confused, you can refer to the currently reproduced metrics. I will repeat the experiment on llama and update ReadME and EasyEdit Paper in time.

Apologize for our oversight. 😔

@zhai-yx
Copy link
Author

zhai-yx commented Apr 13, 2024

Okay, thank you so much!😀

@zxlzr zxlzr changed the title 测试结果复现 测试结果复现(LLaMA Portability bug已修复,结果待更正) Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants