Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Kaiming He initialization, fixed Xavier initialization #311

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

add Kaiming He initialization, fixed Xavier initialization #311

wants to merge 4 commits into from

Conversation

alper111
Copy link

@alper111 alper111 commented Jun 6, 2018

Xavier initialization is x ~ U( -sqrt( 6.0 / (fan_in + fan_out)), +sqrt( 6.0 / (fan_in + fan_out))),
or x ~ N(mean = 0, std = sqrt( 2.0 / (fan_in+fan_out))).

Kaiming initialization is x ~ U( -sqrt( 3.0 / fan_in), +sqrt( 3.0 / fan_in)),
or x ~ N(mean = 0, std = sqrt( 1.0 / fan_in)).

@@ -37,7 +37,7 @@ function xavier(a...)
fanout = size(w, ndims(w))
fanin = div(length(w), fanout)
end
s = convert(eltype(w), sqrt(2 / (fanin + fanout)))
s = convert(eltype(w), sqrt(6 / (fanin + fanout)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, our version is specialized for conv layers with relu activation. The part you changed is called as gain. You may want to update your pr to allow the xavier function to accept the gain parameter. And its default value can be 6.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I barely know the theoretical background. I guess you are referring to "Delving Deep into Rectifiers" paper when you say it is specialized for conv layers with relu activation. In the paper, it states this should hold: n_l * var(w_l) = 2 where n_l is the average number of units per layer. You can check that:
x = xavier(200,300)
(200+300) / 2 * var(x) ~= 0.33, where this value should be 1.0 for Xavier, 2.0 for ReLU activation. I also compared xavier with Tensorflow's equivalent initializer. TF's xavier is ~3 times of xavier, and TF's kaiming (relu specialized xavier) is ~6 times of xavier, consistently.

As for your suggestion I am very new to Julia and I couldn't find a way to edit arguments so that it is compatible with pre-existing models. However there can be another distribution that takes both gain and n as arguments (as in TF).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use keyword arguments for options.

xavier(a...; gain = 6)

@alper111
Copy link
Author

alper111 commented Dec 6, 2018

In the original version of xavier, variance is correctly calculated. This should be the variance of weights so that variance of activations (and of gradients) should remain the same across layers. There is also linear activation assumption in this analysis. However, this calculated variance is the variance of a Gaussian distribution. In the original xavier, weights are drawn from uniform distribution. In order to scale this up correctly, we should scale it by \sqrt(3). This is due to drawing from uniform distribution, nothing to do with activation functions.

If we also want to take activation functions into account, we can change the default gain value (which is 1).
Gain values for different activation functions: https://pytorch.org/docs/stable/_modules/torch/nn/init.html

I change the default Kaiming gain value to sqrt(2) (that is for ReLU activation units) since in the original description this is done in this way. This way of Xavier and Kaiming initializations gives the same variance as in PyTorch.

@ekinakyurek
Copy link
Collaborator

@ozanarkancan do you think is there a problem in this PR?

@ozanarkancan
Copy link
Collaborator

@ekinakyurek @denizyuret The branch can be merged, however, changing the initialization method will possibly break the replicability of experiments that use the current implementation. This should be stated in somewhere...

@denizyuret
Copy link
Owner

denizyuret commented Dec 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants