Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

register_forward_hook in AWD_LSTM not working #2850

Open
arnaujc91 opened this issue Oct 2, 2020 · 15 comments
Open

register_forward_hook in AWD_LSTM not working #2850

arnaujc91 opened this issue Oct 2, 2020 · 15 comments
Assignees
Labels

Comments

@arnaujc91
Copy link

Why are the forward hooks not working for the AWD_LSTM model?

I try the following:

import torch 
import torch.nn as nn
from fastai.text.all import *

def hook_fn(m, i, o):
  print(m._get_name())
  print("     HOOK WORKING!    ")

model =  AWD_LSTM(3, 5, 6, 2)

for m in flatten_model(model):
    if has_params(m):
        m.register_forward_hook(hook_fn)

model(torch.randint(3, (64,72)))

and the output is just the result of forwarding the input, no hooks activated.
Output:

tensor([[[ 0.1744,  0.2180, -0.2876,  0.2426, -0.1670],
         [ 0.1820,  0.2045, -0.2988,  0.2179, -0.1540],
         [ 0.1936,  0.2058, -0.2988,  0.2025, -0.1598],
         ...,
         [ 0.1955,  0.2122, -0.3008,  0.1770, -0.1693],
         [ 0.2033,  0.2148, -0.2982,  0.1771, -0.1725],
         [ 0.2020,  0.2136, -0.2995,  0.1757, -0.1727]],

        [[ 0.1591,  0.2153, -0.2708,  0.2311, -0.1720],
         [ 0.1792,  0.2202, -0.2829,  0.2234, -0.1747],
         [ 0.1858,  0.2211, -0.2892,  0.2106, -0.1770],
         ...,
         [ 0.1911,  0.2190, -0.2972,  0.1798, -0.1790],
         [ 0.1910,  0.2182, -0.2972,  0.1784, -0.1794],
         [ 0.1920,  0.2187, -0.2975,  0.1799, -0.1788]],

        [[ 0.1521,  0.2057, -0.2798,  0.2316, -0.1627],
         [ 0.1764,  0.2048, -0.2896,  0.2143, -0.1631],
         [ 0.1866,  0.2077, -0.2941,  0.2010, -0.1676],
         ...,
         [ 0.1960,  0.2209, -0.2964,  0.1804, -0.1803],
         [ 0.1960,  0.2209, -0.2964,  0.1804, -0.1803],
         [ 0.1960,  0.2209, -0.2964,  0.1804, -0.1803]],

        ...,

        [[ 0.1432,  0.2092, -0.2800,  0.2401, -0.1603],
         [ 0.1666,  0.2100, -0.2891,  0.2253, -0.1622],
         [ 0.1776,  0.2129, -0.2932,  0.2127, -0.1670],
         ...,
         [ 0.1931,  0.2226, -0.2961,  0.1849, -0.1800],
         [ 0.1931,  0.2226, -0.2961,  0.1849, -0.1800],
         [ 0.1931,  0.2226, -0.2961,  0.1849, -0.1800]],

        [[ 0.1734,  0.2117, -0.2915,  0.2975, -0.1528],
         [ 0.2177,  0.2117, -0.3050,  0.2926, -0.1558],
         [ 0.2387,  0.2108, -0.3144,  0.2894, -0.1540],
         ...,
         [ 0.2793,  0.2245, -0.3165,  0.2622, -0.1761],
         [ 0.2824,  0.2274, -0.3155,  0.2660, -0.1752],
         [ 0.2778,  0.2253, -0.3181,  0.2682, -0.1713]],

        [[ 0.1240,  0.1946, -0.2779,  0.2186, -0.1794],
         [ 0.1409,  0.1858, -0.2867,  0.1940, -0.1902],
         [ 0.1542,  0.1844, -0.2905,  0.1751, -0.1994],
         ...,
         [ 0.1659,  0.1914, -0.2969,  0.1420, -0.2113],
         [ 0.1702,  0.1902, -0.2959,  0.1419, -0.2133],
         [ 0.1751,  0.1905, -0.2951,  0.1416, -0.2154]]],
       grad_fn=<TransposeBackward0>)

Clearly the forward hook is not working. Why?

@sutt
Copy link
Contributor

sutt commented Oct 16, 2020

But registering the hook on a layer and feeding that layer input does work.
Your hook modifying the layer objects in model which is an fastai.text.models.awdlstm.AWD_LSTM object.
image

@arnaujc91
Copy link
Author

But the implementation of fastai does not do it that way. Please try to run the following code:

from fastai.text.all import *

path = untar_data(URLs.IMDB_SAMPLE)
imdb = pd.read_csv(path/'texts.csv')
imdb_lm = TextDataLoaders.from_df(imdb, text_col='text', is_lm=True)
learn = language_model_learner(imdb_lm, AWD_LSTM, cbs = [ActivationStats],  metrics=[accuracy])

learn.fit(1)

learn.activation_stats.stats[0]

The result to me is:

(#9) [None,None,None,None,None,None,None,None,{'mean': 0.13280442357063293, 'std': 2.111562967300415, 'near_zero': 0.5319848539272031}]

Basically the forward hooks do not work for almost all the layers in the fastai implementation of ActivationStats. That was my main motivation of this post, because from the beginning this piece of code was not working (as expected).

Please, in case I am doing something wrong tell me (or is this a mistake?).

Thanks!

@sutt
Copy link
Contributor

sutt commented Oct 16, 2020

These are excellent examples. I am reproducing your results exactly.
Could you suggest expected behavior?

In fastai.callback.hook.py, this caught my eye:

def module_summary(learn, *xb):
    "Print a summary of `model` using `xb`"
    #Individual parameters wrapped in ParameterModule aren't called through the hooks in `layer_info`,
    #  thus are not counted inside the summary
    #TODO: find a way to have them counted in param number somehow
    infos = layer_info(learn, *xb)

@arnaujc91
Copy link
Author

Well the expected behaviour is that you get the statistics of the activations for each of the layers. So instead of a None I would expect to find something of the form: {'mean': 0.5, 'std': 0.99, 'near_zero': 0.12}. Apparently in this Language Model there are 9 layers. Just the last layer gets the statistics of the activations, why do None appear for the first 8 layers? All these Nones are because the Hooks are not working.

I kind of went deep into the Language model and realized that this happens because of the following two layers in AWD_LSTM:

  • WeightDropout(rnn, weight_p) => This you will find inside the method _one_rnn. Just substitute this for rnn
  • RNNDropout(input_p) => This you will find it inside the __init__. Just substitute it for input_p, i.e., do nothing.

If you remove these dropouts the forward hooks work fine for the AWD_LSTM architecture. Please let me know how can we fix this. Thanks a lot for your time, it is really appreciated!

@sutt
Copy link
Contributor

sutt commented Oct 16, 2020

I am not able to reproduce your steps above.

My initial testing is showing that although:
Good: ActivationStats.before_fit() registers a hook on all layers within the model,
Bad: only the LinearDecoder layer is getting called for ActivationStats.hook()

@arnaujc91
Copy link
Author

Ok Let's try the following.

First: I modify the AWD_LSTM class in the following way:

class AWD_LSTM_M(Module):
    "AWD-LSTM inspired by https://arxiv.org/abs/1708.02182"
    initrange=0.1

    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token=1, hidden_p=0.2, input_p=0.6, embed_p=0.1,
                 weight_p=0.5, bidir=False):
        store_attr('emb_sz,n_hid,n_layers,pad_token')
        self.bs = 1
        self.n_dir = 2 if bidir else 1
        self.encoder = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)
#         self.encoder_dp = EmbeddingDropout(self.encoder, embed_p)
        self.encoder_dp = self.encoder
        self.rnns = nn.ModuleList([self._one_rnn(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.n_dir,
                                                 bidir, weight_p, l) for l in range(n_layers)])
        self.encoder.weight.data.uniform_(-self.initrange, self.initrange)
        self.input_dp = RNNDropout(input_p)
        self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])
        self.reset()

    def forward(self, inp, from_embeds=False):
        bs,sl = inp.shape[:2] if from_embeds else inp.shape
        if bs!=self.bs: self._change_hidden(bs)

        output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))
        new_hidden = []
        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
            output, new_h = rnn(output, self.hidden[l])
            new_hidden.append(new_h)
            if l != self.n_layers - 1: output = hid_dp(output)
        self.hidden = to_detach(new_hidden, cpu=False, gather=False)
        return output

    def _change_hidden(self, bs):
        self.hidden = [self._change_one_hidden(l, bs) for l in range(self.n_layers)]
        self.bs = bs

    def _one_rnn(self, n_in, n_out, bidir, weight_p, l):
        "Return one of the inner rnn"
        rnn = nn.LSTM(n_in, n_out, 1, batch_first=True, bidirectional=bidir)
#         return WeightDropout(rnn, weight_p)
        return rnn

    def _one_hidden(self, l):
        "Return one hidden state"
        nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
        return (one_param(self).new_zeros(self.n_dir, self.bs, nh), one_param(self).new_zeros(self.n_dir, self.bs, nh))

    def _change_one_hidden(self, l, bs):
        if self.bs < bs:
            nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
            return tuple(torch.cat([h, h.new_zeros(self.n_dir, bs-self.bs, nh)], dim=1) for h in self.hidden[l])
        if self.bs > bs: return (self.hidden[l][0][:,:bs].contiguous(), self.hidden[l][1][:,:bs].contiguous())
        return self.hidden[l]

    def reset(self):
        "Reset the hidden states"
        [r.reset() for r in self.rnns if hasattr(r, 'reset')]
        self.hidden = [self._one_hidden(l) for l in range(self.n_layers)]
`

As you can see I have just modified 2 lines from the original code. Now lets make the following experiment:

def hook_fn(m, i, o):
  print(f"Working for layer: -- {m._get_name()} --")

awd_lstm_modified =  AWD_LSTM_M(vocab_sz=3,
                  emb_sz=5,
                  n_hid=6,
                  n_layers=2)

awd_lstm_original =  AWD_LSTM(vocab_sz=3,
                  emb_sz=5,
                  n_hid=6,
                  n_layers=2)

for model in [awd_lstm_modified, awd_lstm_original]:
    print('-------')
    for m in flatten_model(model):
        if has_params(m):
            print(m._get_name())
            m.register_forward_hook(hook_fn)

#Check this output
awd_lstm_modified(torch.randint(3, (1,4)))

# Check this output
awd_lstm_original(torch.randint(3, (1,4)))

You will see how the modified AWD_LSTM_M prints the forward hooks, so it is clear that the problem comes from these two layers that I have commented in my modified AWD_LSTM_M class.

Check if this also happens for you. Then we would have to see why these two Dropouts do not retain the hooks, could be from PyTorch maybe? I do not know at this point i just stopped it. Would be really could if we could figure it out! :)

@sutt
Copy link
Contributor

sutt commented Oct 16, 2020

Thanks for this! Let me take a look at it.
I've changed ActivationStats to print anytime a hook gets fired like you did, and stopped fastai from removing the hooks after_fit:

class ActivationStats(HookCallback):
    "Callback that record the mean and std of activations."
    run_before=TrainEvalCallback
    def __init__(self, with_hist=False, **kwargs):
        # super().__init__(**kwargs)
        super().__init__(remove_end=False)
        self.with_hist = with_hist

    def before_fit(self):
        "Initialize stats."
        super().before_fit()
        self.stats = L()

    def hook(self, m, i, o):
        o = o.float()
        print(f'hooked a {m}')
        res = {'mean': o.mean().item(), 'std': o.std().item(),
               'near_zero': (o<=0.05).long().sum().item()/o.numel()}
  • This shows only the LinearDecoder layer is getting properly hitting hooks.
  • By changing remove_end=False in the callback, the hooks should stay on the torch Sequential object on the learner's model. So we take that object out, and call forward on directly:
  • As you can see model[0] the (AWS_LTSD modules) don't fire any hooks
  • but model[1] the Linear decoder module does fire it's hook.
    -Finally, we examine the layer and their respective ._forward_hooks attr's: it looks like all layers do have hooks registered here, they just don't fire for some reason?

image

@sutt
Copy link
Contributor

sutt commented Oct 17, 2020

I've found the problem. There are two lines of code you need to edit.

First, the main problem comes from WeightDropout class as you suspected. By "calling" this module, instead of calling its forward argument, we allow the registered hooks a chance to fire (as handled by pytorch's Module._call_impl() )

Second we need to handle the tuple types as output for LSTM layers. So alter the hook function on ActivationStats.

diff --git a/fastai/callback/hook.py b/fastai/callback/hook.py
index 1146d35..d2c715f 100644
--- a/fastai/callback/hook.py
+++ b/fastai/callback/hook.py
@@ -208,6 +208,7 @@ class ActivationStats(HookCallback):
         self.stats = L()

     def hook(self, m, i, o):
+        if isinstance(o, tuple): o = o[0]
         o = o.float()
         res = {'mean': o.mean().item(), 'std': o.std().item(),
                'near_zero': (o<=0.05).long().sum().item()/o.numel()}
diff --git a/fastai/text/models/awdlstm.py b/fastai/text/models/awdlstm.py
index 546dd8a..3dd154b 100644
--- a/fastai/text/models/awdlstm.py
+++ b/fastai/text/models/awdlstm.py
@@ -50,7 +50,8 @@ class WeightDropout(Module):
         with warnings.catch_warnings():
             # To avoid the warning that comes because the weights aren't flattened.
             warnings.simplefilter("ignore", category=UserWarning)
-            return self.module.forward(*args)
+            return self.module(*args)
+            # return self.module.forward(*args)

This gives you largely what you're looking for.
image
I'd like to put this functionality into the library, @arnaujc91 - could you show me essentially what is the structure of activation stats that we should see with this model type?

@arnaujc91
Copy link
Author

I think basically all the None should have some statistics right? I do not know how the people who coded that expected it to work but in the code there is a function named has_params to choose the layers from which we will collect statistics. So my guess is that all the None should disappear. For example in my code from my previous post I get the hooks activated for the Embeddings as well, here in the code you showed, the first two elements of the list are None and they correspond to the Embedding layers. I have checked with your modified code and the hooks are not fired for the Embedding layers so far.

Last issue is that I do not really know what the layers ParameterModule are.

Summary: For sure the first two layers should collect statistics of the weights of the Embeddings, and my bet would be that no element of the list should be None.

Can we maybe ask this to the people who implemented this functionality?

@arnaujc91
Copy link
Author

Hey @sutt !

I saw your pull request. I wanted to ask you what about the Embedding layer? I have spotted the problem: basically EmbeddingDropout is not captured by flatten_modelbecause flatten_modelcaptures just the children, as you can see:

image

The hooks are actually working but the method forward is just called for the layer EmbeddingDropout as it is shown in the following screenshot:

image

Therefore there are just two options, either modify flatten_model or modify EmbeddingDropout. Right now flatten_model produces duplicates as you have seen. Also it does not capture the layer EmbeddingDropout because this last one has children.

Any ideas?? I have been thinking about it but i do not consider myself experienced enough to perform a decent PR.

Thanks a lot @sutt !

@sutt
Copy link
Contributor

sutt commented Oct 24, 2020

@arnaujc91 Thanks for these great examples.
Let me take a look this weekend to study your ideas and see if we can finish this issue off with another PR.
Best,

@arnaujc91
Copy link
Author

Found a solution! Just wrote a new EmbeddingDropout class, and modified a couple of line from the original code of AWD_LSTM. Now the hooks work good and besides there is no duplication of layers while using flatten_model.

class EmbeddingDropout(nn.Embedding):
    "Apply dropout with probability `embed_p` to an embedding layer `emb`."
    def __init__(self, *args, embed_p, **kwargs):
        super().__init__(*args, **kwargs)
        self.embed_p = embed_p

    def forward(self, words, scale=None):
        if self.training and self.embed_p != 0:
            size = (self.weight.size(0),1)
            mask = dropout_mask(self.weight.data, size, self.embed_p)
            masked_embed = self.weight * mask
        else: masked_embed = self.weight
        if scale: masked_embed.mul_(scale)
        return F.embedding(words, masked_embed, ifnone(self.padding_idx, -1), self.max_norm,
                       self.norm_type, self.scale_grad_by_freq, self.sparse)
        
class AWD_LSTM(Module):
    "AWD-LSTM inspired by https://arxiv.org/abs/1708.02182"
    initrange=0.1

    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token=1, hidden_p=0.2, input_p=0.6, embed_p=0.1,
                 weight_p=0.5, bidir=False):
        store_attr('emb_sz,n_hid,n_layers,pad_token')
        self.bs = 1
        self.n_dir = 2 if bidir else 1
        self.encoder_dp = EmbeddingDropout(vocab_sz, emb_sz, embed_p=embed_p, padding_idx=pad_token)
        self.encoder_dp.weight.data.uniform_(-self.initrange, self.initrange)
        # self.encoder = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)
        # self.encoder_dp = EmbeddingDropout(self.encoder, embed_p)
        self.rnns = nn.ModuleList([self._one_rnn(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.n_dir,
                                                 bidir, weight_p, l) for l in range(n_layers)])
        # self.encoder.weight.data.uniform_(-self.initrange, self.initrange)
        self.input_dp = RNNDropout(input_p)
        self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])
        self.reset()

    def forward(self, inp, from_embeds=False):
        bs,sl = inp.shape[:2] if from_embeds else inp.shape
        if bs!=self.bs: self._change_hidden(bs)

        output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))
        new_hidden = []
        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
            output, new_h = rnn(output, self.hidden[l])
            new_hidden.append(new_h)
            if l != self.n_layers - 1: output = hid_dp(output)
        self.hidden = to_detach(new_hidden, cpu=False, gather=False)
        return output

    def _change_hidden(self, bs):
        self.hidden = [self._change_one_hidden(l, bs) for l in range(self.n_layers)]
        self.bs = bs

    def _one_rnn(self, n_in, n_out, bidir, weight_p, l):
        "Return one of the inner rnn"
        rnn = nn.LSTM(n_in, n_out, 1, batch_first=True, bidirectional=bidir)
        return WeightDropout(rnn, weight_p)

    def _one_hidden(self, l):
        "Return one hidden state"
        nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
        return (one_param(self).new_zeros(self.n_dir, self.bs, nh), one_param(self).new_zeros(self.n_dir, self.bs, nh))

    def _change_one_hidden(self, l, bs):
        if self.bs < bs:
            nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
            return tuple(torch.cat([h, h.new_zeros(self.n_dir, bs-self.bs, nh)], dim=1) for h in self.hidden[l])
        if self.bs > bs: return (self.hidden[l][0][:,:bs].contiguous(), self.hidden[l][1][:,:bs].contiguous())
        return self.hidden[l]

    def reset(self):
        "Reset the hidden states"
        [r.reset() for r in self.rnns if hasattr(r, 'reset')]
        self.hidden = [self._one_hidden(l) for l in range(self.n_layers)]

As you can see I just slightly modified the class EmbeddingDropout. Here you have an image where you can compare:

image

Besides I think is also a more clean solution: instead of using two layers self.encoder and self.encoder_dp now we just have one layer self.encoder_dp. Pay attention to the fact that previously the layer self.encoder was just used to create the layer self.encoder_dp, it had no other use besides this. So I think now it is even easier!

On the other hand I do not know if EmbeddingDropout had other uses, so my modification would change any other use of this class in other parts of the code. Still I think now is cleaner.

Let me know what you think! :)

@sutt
Copy link
Contributor

sutt commented Oct 25, 2020

@arnaujc91
I think you need to leave in the AWD_LSTM.encoder object. This will also get the test to pass for your PR. Or maybe just change the test, I think you are correct encoder never gets used on its own.
The forward method might use bypass encoder_dp when using the with_embeds argument.

def forward(self, inp, from_embeds=False):
        bs,sl = inp.shape[:2] if from_embeds else inp.shape
        if bs!=self.bs: self._change_hidden(bs)

        output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))

I've searched the repository for an example of using from_embeds=True but there are no examples of how this should work.
As to why this exists, I can't say for certain, but I'd check out Jeremy's paper on ULMFiT (https://arxiv.org/abs/1801.06146) where he uses an AWD_LSTM. I think what happens is you're fitting a model on say Wikipedia to create a pretrained model, then you get "expanded" embeddings when fine tuning on a specific domain (e.g. IMDB) where certain vocabulary might be more/less common.

Nice work!

@arnaujc91
Copy link
Author

Hey @sutt !

encoder_dp is just used if from_embeds=False, which it is by default. In case from_embeds=True basically we are skipping the self.encoder layer from AWD_LSTM because we are using an external encoder/embedding. So this does not affect the changes I did, my changes just take place for from_embeds=False which is by default.

Yes I saw that the test failed because there must be a self.encoder layer. I am thinking about just renaming the current self.encoder_dp to self.encoder. Do you know if self.encoder_dp must be also in the class? Do you know where I can check the tests to make my changes such that they will pass them?

Thanks a lot Will, I have learned a lot doing this! Thank you for your time and patience!

@jph00
Copy link
Member

jph00 commented Nov 4, 2020

In progress in #2906 thanks to @arnaujc91

@marii-moe marii-moe added the bug label Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants