Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用AutoModel替代build_transformer_model,发现其参数在训练过程中不会被更新 #140

Open
ZayIsAllYouNeed opened this issue Jul 2, 2023 · 14 comments

Comments

@ZayIsAllYouNeed
Copy link

我使用 AutoModel.from_pretrained 来替代 build_transformer_model(config_path, checkpoint_path) 作为backbone后,发现模型训练过程不会更新backbone的参数( requires_grad=True),而其他的加上的 linear 层还是正常更新的。
请问能提示下是哪里的问题吗?

@ZayIsAllYouNeed ZayIsAllYouNeed changed the title 使用AutoModel替代build_transformer_model,发现其参数不会在训练过程中被更新 使用AutoModel替代build_transformer_model,发现其参数在训练过程中不会被更新 Jul 2, 2023
@ZayIsAllYouNeed
Copy link
Author

另外,我用convert_deberta_v2.py将预训练模型参数名改好后,用build_transformer_model加载,训练完成后,发现train_model.bert.embeddings.word_embeddings.weight参数“没”被更新,其他的层有更新(比如train_model.bert.encoderLayer[0].multiHeadAttention.o.weight)

@Tongjilibo
Copy link
Owner

我刚刚没隔一些step打印权重的sum(),从打印结果看是有变动的,只是变动的幅度和别的比是小了一点

2023-07-02 21:48:03 - Start Training
2023-07-02 21:48:03 - Epoch: 1/10
   9/1129 [..............................] - ETA: 8:15 - loss: 0.7241 - accuracy: 0.5417 [embedding]:  -11801.388671875  [o.weight]:  7.179624557495117
  19/1129 [..............................] - ETA: 5:35 - loss: 0.6300 - accuracy: 0.6184 [embedding]:  -11801.509765625  [o.weight]:  7.172887325286865
  29/1129 [..............................] - ETA: 4:48 - loss: 0.6178 - accuracy: 0.6422 [embedding]:  -11801.685546875  [o.weight]:  7.165395736694336
  39/1129 [>.............................] - ETA: 4:28 - loss: 0.5923 - accuracy: 0.6603 [embedding]:  -11801.83203125  [o.weight]:  7.176580429077148
  49/1129 [>.............................] - ETA: 4:15 - loss: 0.5714 - accuracy: 0.6862 [embedding]:  -11801.955078125  [o.weight]:  7.194809436798096
  59/1129 [>.............................] - ETA: 4:03 - loss: 0.5740 - accuracy: 0.6833 [embedding]:  -11801.974609375  [o.weight]:  7.193059921264648
  69/1129 [>.............................] - ETA: 3:57 - loss: 0.5531 - accuracy: 0.7029 [embedding]:  -11801.9990234375  [o.weight]:  7.181048393249512
  79/1129 [=>............................] - ETA: 3:50 - loss: 0.5431 - accuracy: 0.7144 [embedding]:  -11801.96484375  [o.weight]:  7.179488182067871
  89/1129 [=>............................] - ETA: 3:44 - loss: 0.5396 - accuracy: 0.7191 [embedding]:  -11801.923828125  [o.weight]:  7.181138038635254
  99/1129 [=>............................] - ETA: 3:39 - loss: 0.5331 - accuracy: 0.7216 [embedding]:  -11801.8671875  [o.weight]:  7.16298246383667

@ZayIsAllYouNeed
Copy link
Author

嗯嗯,是的,loss会下降,但是模型只有一部分参数被更新。
AutoModel.from_pretrained 来替代 build_transformer_model时,得到的backbone(即self.deberta)没有被更新,其他加上的linear有更新。
用build_transformer_model的时候,发现deberta.embeddings.word_embeddings.weight没有被更新,其他attention层有更新

@ZayIsAllYouNeed
Copy link
Author

我加载的Erlangshen-DeBERTa-v2-97M-Chinese

@ZayIsAllYouNeed
Copy link
Author

请问您这里时使用的build_transformer_model吗?

@Tongjilibo
Copy link
Owner

我刚刚看的这个example,我打印出来权重是有略微的改变的,那你直接用huggingface的试试看呢,那边是什么情况

@Tongjilibo
Copy link
Owner

你这样修改看看打印出来是否有变化

class Evaluator(Callback):
    """评估与保存
    """
    def __init__(self):
        self.best_val_acc = 0.

    def on_batch_begin(self, global_step, local_step, logs=None):
        if (global_step+1) % 50 == 0:
            print('[embedding]: ', model.bert.embeddings.word_embeddings.weight[:4,:4].detach())

@ZayIsAllYouNeed
Copy link
Author

不好意思,我知道为什么在我这embedding看起来没有变化了:
因为我之前只关注了一些生僻的token的向量,而这些token在训练语料里没有出现,所以这部分的向量没有更新,而其他 语料中出现的token向量是有更新的
给您造成了误会,抱歉~

@ZayIsAllYouNeed
Copy link
Author

至于之前 AutoModel.from_pretrained 来替代 build_transformer_model时,我看attention层的向量训练前后并没有变化

@Tongjilibo
Copy link
Owner

嗯嗯,应该是要语料中出现该token,其才会更新到embedding的权重中去

@ZayIsAllYouNeed
Copy link
Author

使用 AutoModel.from_pretrained 来替代 build_transformer_model(config_path, checkpoint_path) 作为backbone后,发现模型训练过程不会更新backbone的参数( requires_grad=True),请问这个问题您能帮忙解答下吗?
以下是loss和参数的变化:

bert.encoder.layer[0].attention.output.dense.weight:
tensor([[-0.0097, -0.0309, -0.0151, -0.0192],
[-0.0226, 0.0237, 0.0011, 0.0200],
[ 0.0050, 0.0198, -0.0224, 0.0068],
[ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7')
10/31 [========>.....................] - ETA: 8s - loss: 0.5647 - subject_loss: 0.1725 - object_loss: 0.3922
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[-0.0097, -0.0309, -0.0151, -0.0192],
[-0.0226, 0.0237, 0.0011, 0.0200],
[ 0.0050, 0.0198, -0.0224, 0.0068],
[ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7')
20/31 [==================>...........] - ETA: 4s - loss: 0.4588 - subject_loss: 0.1573 - object_loss: 0.3015
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[-0.0097, -0.0309, -0.0151, -0.0192],
[-0.0226, 0.0237, 0.0011, 0.0200],
[ 0.0050, 0.0198, -0.0224, 0.0068],
[ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7')
30/31 [============================>.] - ETA: 0s - loss: 0.4164 - subject_loss: 0.1510 - object_loss: 0.2654
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[-0.0097, -0.0309, -0.0151, -0.0192],
[-0.0226, 0.0237, 0.0011, 0.0200],
[ 0.0050, 0.0198, -0.0224, 0.0068],
[ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7')
31/31 [==============================] - 11s 366ms/step - loss: 0.4136 - subject_loss: 0.1505 - object_loss: 0.2631

@Tongjilibo
Copy link
Owner

我看loss是下降的,说明肯定有参数更新了,你可以试着记录所有参数层的权重和看看呢,看看哪些层变化了,哪些层没变化

@Tongjilibo
Copy link
Owner

我感觉这个框架应该没啥关系,用bert4torch或者hf的trainer应该不是导致这个问题的原因

@ZayIsAllYouNeed
Copy link
Author

我用的CasRel代码,看参数只更新了self.bert 以外的,如 self.linear1
然后我把加载的模型换成了bert模型,发现就能正常更新; 而之前的deberta v2就不行,并且最终效果很差:

class Model(BaseModel):
    def __init__(self) -> None:
        super().__init__()
        # self.bert = build_transformer_model(config_path, checkpoint_path, model='deberta_v2')
        self.bert = AutoModel.from_pretrained("../../data/bert/Erlangshen-DeBERTa-v2-97M-Chinese")
        self.linear1 = nn.Linear(768, 2)
        self.condLayerNorm = LayerNorm(hidden_size=768, conditional_size=768 * 2)
        self.LayerNorm = LayerNorm(hidden_size=768)
        self.linear2 = nn.Linear(768, len(predicate2id) * 2)

以下是加载bert的打印结果,是正常的:

bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0147, -0.0067, -0.0006, -0.0297],
[ 0.0141, -0.0764, -0.1015, -0.0069],
[-0.0212, 0.0386, -0.0464, -0.0098],
[ 0.0502, 0.0950, -0.0278, -0.0396]], device='cuda:7')
10/31 [========>.....................] - ETA: 15s - loss: 0.6156 - subject_loss: 0.1724 - object_loss: 0.4431
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0148, -0.0065, -0.0003, -0.0295],
[ 0.0140, -0.0765, -0.1016, -0.0070],
[-0.0205, 0.0393, -0.0459, -0.0091],
[ 0.0505, 0.0953, -0.0279, -0.0393]], device='cuda:7')
20/31 [==================>...........] - ETA: 5s - loss: 0.4846 - subject_loss: 0.1588 - object_loss: 0.3258
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0149, -0.0064, -0.0002, -0.0294],
[ 0.0141, -0.0764, -0.1016, -0.0069],
[-0.0203, 0.0395, -0.0458, -0.0089],
[ 0.0506, 0.0953, -0.0279, -0.0392]], device='cuda:7')
30/31 [============================>.] - ETA: 0s - loss: 0.4405 - subject_loss: 0.1542 - object_loss: 0.2863
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0150, -0.0064, -0.0001, -0.0294],
[ 0.0141, -0.0764, -0.1016, -0.0069],
[-0.0202, 0.0397, -0.0458, -0.0088],
[ 0.0506, 0.0953, -0.0279, -0.0392]], device='cuda:7')
31/31 [==============================] - 13s 430ms/step - loss: 0.4373 - subject_loss: 0.1535 - object_loss: 0.2837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants