R1 Regularization #671

denizyuret · 2021-12-03T07:36:12Z

@Kausta found a bug in the cat/uncat higher order gradients implementing R1 regularization. I am moving from email to this github issue to follow up. Here is his error description:

The current implementation is on the Github page (https://github.com/Kausta/HiSD.jl) (with the error in dis_loss_real function in core/networks.py), and the main error I am getting is the following:

ERROR: LoadError: MethodError: no method matching back(::typeof(AutoGrad.uncat), ::Type{AutoGrad.Arg{4}}, ::Knet.KnetArrays.KnetMatrix{Float32}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::Int64, ::Int64, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::Knet.KnetArrays.KnetMatrix{Float32})

Here is his references on PyTorch/TF implementations:

I am adding the papers, documentations and the implementations we discussed.
The R1 regularization was defined in the paper https://arxiv.org/pdf/1801.04406.pdf (Which Training Methods for GANs do actually Converge?), which simplifies the gradient regularization from https://arxiv.org/pdf/1705.09367.pdf (Stabilizing Training of Generative Adversarial Networks through Regularization).
The original R1 implementation can be found at https://github.com/ChristophReich1996/Dirac-GAN/blob/decb8283d919640057c50ff5a1ba01b93ed86332/dirac_gan/loss.py#L292, and the paper I am implementing uses the following implementation https://github.com/imlixinyang/HiSD/blob/main/core/networks.py#L80 (paper link: https://arxiv.org/pdf/2103.01456.pdf).

There had been a Variable interface (like Autograd.jl Param) in PyTorch previously, but it is deprecated in favor of a more unified interface using only Tensors. (PyTorch Autograd automatically supports Tensors with requires_grad set to True, and both gradients and saved forward values are kept directly on the tensors. During the forward pass, an operation is only recorded in the backward graph if at least one of its input tensors require grad. During the backward pass (.backward()), only leaf tensors with requires_grad=True will have gradients accumulated into their .grad fields. Internally, autograd represents this graph as a graph of Function objects (really expressions), and stores the entry points to the graph on the .grad_fn attribute of each torch.Tensor.)
These are documented in https://pytorch.org/docs/stable/notes/autograd.html, with example based explanations in https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html.
The torch.autograd.grad function is documented at https://pytorch.org/docs/stable/generated/torch.autograd.grad.html, and the remaining autograd related functions are documented in https://pytorch.org/docs/stable/autograd.html?highlight=variable, including the functional higher level API for computing jacobians, hessians and jacobian/hessian dot products with input vectors.
Moreover, the pytorch documentation contains a gradient penalty example (WGAN-GP gradient penalty, similar to the Autograd.jl issue denizyuret/AutoGrad.jl#120), however, it is inside the documentation for AMP (automatic mixed precision): https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-penalty.

Tensorflow also contains a documentation for higher order gradients with nested tapes in https://www.tensorflow.org/guide/advanced_autodiff#higher-order_gradients, and it is followed (in the same link) by an input gradient penalty example (The gradient of (the magnitude of the gradient with respect to the inputs) with respect to the model), which is similar to R1 regularization. I am adding this example, together with the small hypothetical example I wrote for R1 implementation as an attachment. (Here is a gist: https://gist.github.com/denizyuret/1af3577afbe6a53d61bc75f86fed4ac4)

Meanwhile, I will start by writing a minimal reproducible example for the uncat bug, and checking the current implementation/unit tests.
Also, pytorch contains a gradgradcheck method (https://pytorch.org/docs/stable/generated/torch.autograd.gradgradcheck.html#torch.autograd.gradgradcheck) for gradients of gradients. I think a similar one for Autograd.jl could be a nice addition for easier testing/bug-fixing.

The text was updated successfully, but these errors were encountered:

denizyuret · 2021-12-03T08:29:10Z

Looking at the error message more carefully, it seems to be trying to find the gradient of uncat wrt its 4'th argument. The signature for uncat is: uncat(dy, argn, dims, x...). Its operation can be described as follows: cat concatenates a bunch of x's into a y. In the backward pass we receive dy, the gradient of loss wrt y. Uncat takes this dy and extracts the region that corresponds to the argn'th input argument from it. It is basically an indexing operation into dy. Therefore only the first argument effects its return value, the x's only determine the shape of the return value. The derivative of uncat wrt any argument other than its first argument is 0. So we never defined them because under normal (first order) use back(::uncat,...) never gets called with argn!=1.

Now I don't quite understand why the second order code calls uncat's back method for the fourth argument. But assuming it does so for legitimate reasons, the fix is simple. Just define:

AutoGrad.back(::typeof(AutoGrad.uncat), ::AutoGrad.Arg{N}, dy, y, x...) = nothing
AutoGrad.back(::typeof(AutoGrad.uncat1), ::AutoGrad.Arg{N}, dy, y, x...) = nothing

as a catch-all for any derivative request for any argument other than the first. And see if the code works with this. If it does I will add this definition to core.jl.

You can try the following version of AutoGrad which includes the above fix:

pkg> add AutoGrad#dy/fix671

Kausta · 2021-12-06T14:37:38Z

I wrote a minimal working example to test the issue:

using Knet
using Statistics: mean
atype = Knet.atype()

# A simple model for the example
struct Linear; w; b; end
Linear(in_dim::Int, out_dim::Int) = Linear(param(out_dim,in_dim,atype=atype), param0(out_dim,atype=atype))
(l::Linear)(x) = l.w * x .+ l.b

struct Model; lin1; lin2; lin3; end
Model(in_dim1::Int,in_dim2::Int) = Model(Linear(in_dim1, 1), Linear(in_dim2, 1), Linear(2, 1))
function (m::Model)(x, y)
    out1 = m.lin1(x)
    out2 = m.lin2(y)
    outc = vcat(out1, out2)
    return m.lin3(outc)
end

# A sample loss function
function loss(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

x = convert(atype, randn(10, 8))
y = convert(atype, randn(5, 8))
model = Model(10, 5)

L = @diff loss(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

With Autograd 1.2.4, differentiating loss produces the following error as expected:

ERROR: LoadError: MethodError: no method matching back(::typeof(AutoGrad.uncat), ::Type{AutoGrad.Arg{4}}, ::Knet.KnetArrays.KnetMatrix{Float32}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::Int64, ::Int64, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}})

With AutoGrad#dy/fix671, it works and outputs the following as expected:

(value(L), grad(L, model.lin1.w)) = (4.3223014f0, K32(1,10)[0.32105368⋯])

However, gradients are the same even if we don't include the following block.

gradfn = grad(t -> sum(model(t, y)))
grad_out = gradfn(x)
loss += sum(abs2.(grad_out)) / size(x)[end]

Moreover, the following outputs nothing:

L = @diff sum(abs2.(grad(t -> sum(model(t, y)))(x))) / size(x)[end]
@show grad(L, model.lin1.w)

Hence, it runs without any compile time issues, however, I don't think it outputs any second order gradients. Is it possible that the new back functions defined are too generic and always used as the gradient of uncat?

denizyuret · 2021-12-08T12:39:01Z

First, mixing of the old grad interface (i.e. grad(f)) and the new grad interface (grad(result, param)) is not well tested and part of the problem seems to be mixing the two. So if you can find a way to express the computation using only the new interface (i.e. only using @diff and grad(result, param)), that could solve the problem.

Nevertheless I am also trying to figure out what goes wrong when we do mix the two interfaces. I found two problems and pushed a new update to the dy/fix671 branch:

Old grad function got confused when there was more than one Param in the computation, this should be fixed now.
This is more difficult: there was a PR (Tape confusion fix AutoGrad.jl#75) for fixing "tape confusion" which I understood at some point but now forgot what the problem was. The change here is at https://github.com/denizyuret/AutoGrad.jl/blob/1daede9b3215c170b5f9f0860042dca39c54805f/src/core.jl#L135 to L139 which is commented out in dy/fix671. What this does is if there are multiple tapes, it duplicates the Params and Results using the identity function. When I comment this out your code seems to work. However I presume this was added for a reason and was fixing some other problem which now I broke. So needs to be investigated a bit more.

Kausta · 2021-12-08T18:13:57Z

It now works while using only the @diff interface and when mixing both interfaces. I updated the MWE as following to initially test with only using @diff:

using Knet
using Statistics: mean
atype = Knet.atype()

# A simple model for the example
struct Linear; w; b; end
Linear(in_dim::Int, out_dim::Int) = Linear(param(out_dim,in_dim,atype=atype), param0(out_dim,atype=atype))
(l::Linear)(x) = l.w * x .+ l.b

struct Model; lin1; lin2; lin3; end
Model(in_dim1::Int,in_dim2::Int) = Model(Linear(in_dim1, 1), Linear(in_dim2, 1), Linear(2, 1))
function (m::Model)(x, y)
    out1 = m.lin1(x)
    out2 = m.lin2(y)
    outc = vcat(out1, out2)
    return m.lin3(outc)
end

# Loss1: Only first order, Loss2: first+second order, test: only second order
function loss1(model, x, y)
    out = model(x, y)
    return mean(out)
end

function loss2(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    xp = isa(x, Param) ? x : Param(x)
    g = @diff sum(model(xp, y))
    grad_out = grad(g, xp)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

function test(model, x, y)
    xp = Param(x)
    g = @diff sum(model(xp, y))
    grad_out = grad(g, xp)
    return sum(abs2.(grad_out)) / size(x)[end]
end

x = convert(atype, randn(10, 8))
y = convert(atype, randn(5, 8))
model = Model(10, 5)

L = @diff loss1(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff loss2(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff test(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

grad_result = @Knet.gcheck loss2(model, Param(x), y) (verbose=1,)
println("gcheck result: $grad_result")

and it works without an error using the AutoGrad#dy/fix671 branch. We get the output:

(value(L), grad(L, model.lin1.w)) = (-0.1450544f0, K32(1,10)[-0.06908582⋯])
(value(L), grad(L, model.lin1.w)) = (-0.033414274f0, K32(1,10)[-0.017087717⋯])
(value(L), grad(L, model.lin1.w)) = (0.111640126f0, K32(1,10)[0.051998105⋯])
gcheck result: true

Moreover, the gradients are no longer nothing, and gcheck also reports correct gradients.

In addition, the fix for the mixed interface also seems to work for this test case. By adding the following code:

function loss_mixed_interface(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

function test_mixed_interface(model, x, y)
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    return sum(abs2.(grad_out)) / size(x)[end]
end

L = @diff loss_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff test_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

we get the additional output as:

(value(L), grad(L, model.lin1.w)) = (-0.033414274f0, K32(1,10)[-0.017087717⋯])
(value(L), grad(L, model.lin1.w)) = (0.111640126f0, K32(1,10)[0.051998105⋯])

which agree with the results from only using @diff interface.

Although it now works for higher order gradients, I think this introduces back the bug from denizyuret/AutoGrad.jl#75 as I get true for both of the following statements

grad(x -> x*grad(y -> x+y)(x))(5.0) == 2
grad(x -> x*grad(y -> x+y)(1x))(5.0) == 1

I am trying to understand why the fixes for denizyuret/AutoGrad.jl#75 break the higher order gradients with mixed interface, and I will update if I can find a solution. I will also re-check whether the @diff only version works without removing the tape confusion bug-fix, and it that case, requiring using the same interface in the code could also be an option for now.

BariscanBozkurt · 2021-12-16T14:32:27Z

I come across with a very similar error while I am implementing an Implicit-GON (Gradient Origin Network) model for implicit learning task. pkg> add AutoGrad#dy/fix671 seems to fix the problem for small working examples. I tried to debug my implementation with small dimensional toy dataset after this fix, and it worked fine. However, for high dimensional data, I could not obtain an output for nearly 10 minutes and I stopped the code. Now, I will share my MWEs for detailed explanations.

As I mentioned in the previous issue #670, I was trying to obtain a derivative of a loss function after two forward passes which leads to a second order derivative. In the following mwe, I want to take the derivative of loss_train(theta,x) function where I first feed the origin to the model and take the negative gradient of MSE loss w.r.t. this origin as my new latent point. After that, I feed this new latent point to the model and compute the MSE. I am able to take the gradient of loss_train(theta,x) function in the following example and note that dimensions are very small (latent_dim is 2, batch_size is 3, etc.).

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
    z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta, z_in) .+ 0.001
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 2
i = 4
o = 1
w = 4
batch_size = 3

x = atype(randn(o,w,batch_size)) # Target
theta = Param(atype(randn(o,i))) # Model Weight
mgrid = get_mgrid(2) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
# z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type

derivative_model = @diff loss_train(theta,x) # Differentiate the loss_train.
# It is working in this example

However, if I use higher dimensional data and more layers in the model as in the following modification of the above mwe, I wait too much and cannot obtain an output for 10 minutes. Therefore, I am stopping the execution of the code. My implementation includes lots of cat operation, reshaping and permuting dims. However, I am not sure whether these operations slows down the taking derivative.

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
    z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta[1], z_in) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[2], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[3], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[4], z)
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64

x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)

mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_origin = @diff loss(theta, z, x) # This works fine
println(derivative_origin)
derivative_model = @diff loss_train(theta,x) # This might work but takes too much time (I waited for 10 min and did not obtain an output)

Is there any implementation detail which I misses and that's why my code is running extremely slow? Even taking the derivative of loss(theta, z, x) takes 2-3 seconds. Also, the sinus activations inside model forward pass function does not slow down the implementation in my opinion. I cannot obtain the output even if I delete the sin activations.

Kausta · 2021-12-16T14:45:36Z

@BariscanBozkurt, can you try it with a smaller batch size for at least 2 iterations ? In my case, the first past through the model takes significantly longer. Currently, first 10 iterations complete in approximately 100 seconds, whereas 10 iterations take approximately 15 seconds. If we assume only the first iteration is slow ( which I suspect is due to precompilation ), than it would imply that first iteration takes approximately 135 seconds whereas the other iterations take 1.5 seconds. In other words, the first iteration is approximately 90 times slower. Maybe it's the case for your model too and it would speed up significantly after the first iteration.

BariscanBozkurt · 2021-12-16T15:58:37Z

Hi @Kausta. Thank you for your quick reply. I think I understand the problem now. The problem is not about the precompilation since other iterations also takes so much time. It is high probably due to the custom function batched_linear(theta, x_in; atype = KnetArray{Float32}) . I think taking the second order derivative of a function which includes two passes batched_linear function is very slow since AutoGrad tries to figure out the second derivative of this custom function. If I try to take derivative of loss(theta, z, x) after compiling, it is fast and it only includes one pass of the model. However, I do not know how I can make it faster to take the second order derivative. In PyTorch, default matrix multiplication is able to perform such a vectorized matrix multiplication for each batch. Since the Julia matrix multiplication does not support that I needed to write it by myself. However, it apparently slows down everything significantly.

BariscanBozkurt · 2021-12-16T19:49:04Z

Disregard my previous comment. In my second example code, the hcat function inside model_forw(theta, z) concatenates z 784 times. Normally, I would like to take the gradient of the loss() function with respect to z inside the loss_train() function. However, if I define z_rep as Param type outside model_forw() function and take the gradient of loss() function with respect to z_rep inside loss_train(theta,x) function, the code works pretty much fast. Therefore, the following piece of code works well.

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z_rep) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
#     z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta[1], z_in) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[2], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[3], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[4], z)
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    z_rep = Param(atype(hcat([z for _ = 1:size(c,2)]...)))
    derivative_origin = @diff loss(theta, z_rep, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z_rep) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64

x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)

mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector 
z_rep = Param(hcat([z for _ = 1:size(c,2)]...)) # Make z_rep Param type this time
# The following line (derivative_origin ) works fine again. However, I do not want to obtain the gradient 
# w.r.t. z_rep actually. I need the gradient w.r.t z !!!
derivative_origin = @diff loss(theta, z_rep, x) 
# The following line to take the derivative w.r.t. model weights is fast now.
derivative_model = @diff loss_train(theta,x)

Here, instead of defining z as Param type, I defined z_rep as Param type outside the model_forw(). With this way, I can take the gradient of loss_train() function w.r.t. model weights (theta) very quickly, and it works faster in the follow-up iterations. Therefore, I suspect that concatenating lots of matrices inside a forward pass algorithm makes taking derivative very difficult. However, I could not find a work-around solution since I need the gradient of the loss() function w.r.t. z inside the loss_train() function. If I could use repeat() function for Param type KnetArrays instead of hcat or cat function since I keep concatenating the same matrix, maybe it can solve my problem. It corresponds to the issue #635, but since I need z as a Param type I cannot use the recommended workaround solution for that.

BariscanBozkurt · 2021-12-18T20:59:35Z

I found my workaround solution. Instead of using hcat function to repeat my z matrix over its second dimension 784 times, I used 1x1 convolution operation with convolution weights of all ones. At the end of the day, if I use the following lines of codes in my above example functions, my code runs fast. I have to give the credits to @ugrulas for this solution.

using Knet
using Statistics: mean
atype = Knet.atype()

one_conv_weight = atype(ones(1,1,1,784)) #Globally define convolution weights of all ones

num_latent = 32
batch_size = 64

z = Param(atype(zeros(batch_size, 1, num_latent))) #size : (64,1,32)
# We won't use the following line which includes hcat function to repeat z
# 784 times. Instead, we utilize 1x1 convolution.
# z_rep = hcat([z for _ = 1:784]...) # size : (64,784,32)
z_ = copy(z) # Create a copy of z, so that z_ is not param type
z_ = permutedims(reshape(z_,64,1,1,32),(4,3,2,1)) # size : (32,1,1,64)
z_ = conv4(one_conv_weight, z_)[:,1,:,:] # size : (32, 784, 64)
z_rep = permutedims(z_, (3,2,1)) #size : (64, 784, 32)

denizyuret assigned denizyuret and unassigned denizyuret Dec 3, 2021

denizyuret assigned Kausta Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R1 Regularization #671

R1 Regularization #671

denizyuret commented Dec 3, 2021 •

edited

denizyuret commented Dec 3, 2021 •

edited

Kausta commented Dec 6, 2021 •

edited

denizyuret commented Dec 8, 2021

Kausta commented Dec 8, 2021

BariscanBozkurt commented Dec 16, 2021 •

edited

Kausta commented Dec 16, 2021

BariscanBozkurt commented Dec 16, 2021

BariscanBozkurt commented Dec 16, 2021

BariscanBozkurt commented Dec 18, 2021

R1 Regularization #671

R1 Regularization #671

Comments

denizyuret commented Dec 3, 2021 • edited

denizyuret commented Dec 3, 2021 • edited

Kausta commented Dec 6, 2021 • edited

denizyuret commented Dec 8, 2021

Kausta commented Dec 8, 2021

BariscanBozkurt commented Dec 16, 2021 • edited

Kausta commented Dec 16, 2021

BariscanBozkurt commented Dec 16, 2021

BariscanBozkurt commented Dec 16, 2021

BariscanBozkurt commented Dec 18, 2021

denizyuret commented Dec 3, 2021 •

edited

denizyuret commented Dec 3, 2021 •

edited

Kausta commented Dec 6, 2021 •

edited

BariscanBozkurt commented Dec 16, 2021 •

edited