测试结果复现(LLaMA Portability bug已修复，结果待更正) #227

zhai-yx · 2024-04-12T08:42:06Z

您好，我在复现你们使用EasyEdit在LlaMA-2-7B上的指标的时候遇到了问题。如下图是你们在README中给出的测试结果

我尝试复现其中FT的结果，我使用命令

python run_zsre_llama2.py \
    --editing_method=FT \
    --hparams_dir=../hparams/FT/llama-7b \
    --data_dir=./data

llama-7b.yaml的内容是

  alg_name: "FT"
  model_name: "../hugging_cache/llama-2-7b"
  device: 0
  layers: [21]
  num_steps: 25
  batch_size: 1
  max_length: 40
  lr: 5e-4
  weight_decay: 0
  kl_factor: 0
  norm_constraint: false
  objective_optimization: "prompt_last"
  rewrite_module_tmp: "model.layers.{}.mlp.down_proj.weight"
  layer_module_tmp: "model.layers.{}"
  mlp_module_tmp: "model.layers.{}.mlp"
  attn_module_tmp: "model.layers.{}.self_attn"
  ln_f_module: "model.norm"
  lm_head_module: "lm_head"
  model_parallel: false

使用的模型来自https://huggingface.co/meta-llama/Llama-2-7b-hf
测试数据集是zsre_mend_eval_portability_gpt4.json的前100项
我添加了四行代码用于测试最终结果：

print("Reliability: ", sum([i["post"]["rewrite_acc"][0] for i in metrics])/len(metrics))
print("Generalization: ", sum([i["post"]["rephrase_acc"][0] for i in metrics])/len(metrics))
print("Locality: ", sum([i["post"]["locality"]["neighborhood_acc"][0] for i in metrics])/len(metrics))
print("Portability: ", sum([i["post"]["portability"]["one_hop_acc"][0] for i in metrics])/len(metrics))

其中Reliability, Generalization和Locality的测试结果和您的结果接近，但是Portability却达到了54.05
我还尝试在edit函数中加入了summary_metrics=True的参数，以此来调用你们这边的指标总结功能，发现测出来的portability的one-hop-acc同样是这个值。

我还去测试了MEND的结果，发现Portability同样是高了非常多，是53.75.
我感到很困惑，是我的测试方法不对吗？

The text was updated successfully, but these errors were encountered:

pengzju · 2024-04-13T01:51:46Z

Thank you for your time to reproduce the results of reliability, generalization, and locality. EasyEdit's metrics have gone through several versions of iterations, which may be the reason for the inconsistency in measuring portability. Please give me some time, I will evaluate the corresponding indicators and give you feedback in time. 🥹

pengzju · 2024-04-13T04:37:45Z

EasyEdit appreciates your feedback. I reviewed the experimental results and found that the performance on GPT-J is still less than 5% (FT). You can refer to the paper for this: https://arxiv.org/abs/2305.13172

Using Llama, I reproduced your results. I think the reason is that there is a special token in the tokenizer of llama, and there was no special processing in the original version, resulting in poor results. No need to be confused, you can refer to the currently reproduced metrics. I will repeat the experiment on llama and update ReadME and EasyEdit Paper in time.

Apologize for our oversight. 😔

zhai-yx · 2024-04-13T06:33:05Z

Okay, thank you so much!😀

pengzju added the question Further information is requested label Apr 13, 2024

pengzju self-assigned this Apr 13, 2024

zxlzr changed the title ~~测试结果复现~~ 测试结果复现(LLaMA Portability bug已修复，结果待更正) Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

测试结果复现(LLaMA Portability bug已修复，结果待更正) #227

测试结果复现(LLaMA Portability bug已修复，结果待更正) #227

zhai-yx commented Apr 12, 2024

pengzju commented Apr 13, 2024

pengzju commented Apr 13, 2024

zhai-yx commented Apr 13, 2024

测试结果复现(LLaMA Portability bug已修复，结果待更正) #227

测试结果复现(LLaMA Portability bug已修复，结果待更正) #227

Comments

zhai-yx commented Apr 12, 2024

pengzju commented Apr 13, 2024

pengzju commented Apr 13, 2024

zhai-yx commented Apr 13, 2024