Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when apply rezero to bert or gpt, get NAN gradients #8

Open
yyht opened this issue Mar 28, 2020 · 5 comments
Open

when apply rezero to bert or gpt, get NAN gradients #8

yyht opened this issue Mar 28, 2020 · 5 comments

Comments

@yyht
Copy link

yyht commented Mar 28, 2020

Hi, nice work. When I apply it to shallower bert or gpt, after initialization, it often get NAN gradients(even for deeper architecture).

@calclavia
Copy link
Collaborator

@yyht A few questions:

  • Did you initialize \alpha to zero?
  • How did you initialize the embedding matrix? We found that GPT2's embedding initialization doesn't work very well.

@yyht
Copy link
Author

yyht commented Mar 30, 2020

  1. I initizlied \alpha to zero
  2. the initialization are followed by official BERT initialization:
    ebmbedding matrix and kernel matrix are initialized via:
    def create_initializer(initializer_range=0.02):
    """Creates a truncated_normal_initializer with the given range."""
    return tf.truncated_normal_initializer(stddev=initializer_range)

@calclavia
Copy link
Collaborator

Try initializing the embedding matrix to uniform distribution drawn from +- 1 / d.

@sooheon
Copy link

sooheon commented Aug 20, 2020

@calclavia can you give a little more insight into reasoning for this embedding init recommendation? Curious if it's motivated by empirical performance or other theoretical justification.

@calclavia
Copy link
Collaborator

calclavia commented Aug 21, 2020

@sooheon It depends on the particular implementation of your Transformer. Some implementations (Huggingface) scale the embedding by 1 / d before padding it into higher layers while initializing the embedding with a uniform distribution (-1 to + 1). This effectively does the same thing as initializing it as +- 1/d.

The reasoning for this initialization is less to do with our paper - we simply follow what previous work has recommended. I believe the Attention is all your need paper recommended 1/d scaling for attentional softmax (when d is large). By scaling to 1/d, the gradients for the softmax layer is more well behaved.

The same principle is applied to the output softmax when predicting output vocabularies. When Rezero initializes the Transformer layers to zero, it essentially starts off as a pass-through from input embedding directly to output embedding. Having 1/d initialization ensures the gradients as well behaved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants