Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvTranspose2d layers not being tracked #43

Open
marthinwurer opened this issue Sep 12, 2021 · 6 comments
Open

ConvTranspose2d layers not being tracked #43

marthinwurer opened this issue Sep 12, 2021 · 6 comments

Comments

@marthinwurer
Copy link

class simple(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3)
        self.deconv1 = nn.ConvTranspose2d(32, 3, 3)

simple_model = simple()
tracker2 = CheckLayerSat("my_experiment", save_to="plotcsv", modules=simple_model, device=image.device)

output:

added layer conv1
Skipping deconv1

This is an awesome tool, but I'd love to see how well the decoder part of my autoencoder works.

@MLRichter
Copy link
Collaborator

Hi,
thanks for pointing this out.
I think this should be an easy fix / new feature.
I try to get a PR done this week.

@marthinwurer
Copy link
Author

Awesome!

image

If I can ask some more questions, this is the result from swapping the deconv for upscaling and regular conv. When you have high saturation like shown, is it best to add more filters or add more layers, or both?

@MLRichter
Copy link
Collaborator

MLRichter commented Sep 12, 2021

My research is primarily focused on classifiers, so take everything I say with a grain of salt.

With that said, you should try to manipulate the network as a whole and never individual layers.
So if you increase the amount of filters, you should scale them globally.
This will change the saturation level. (you should aim for something arround 20-40%, depending on the dataset

image

However, with the width (number of filters per layer) in a network it is somewhat like the trees in a decision Forrest, in the sense that increasing only hurts computational efficiency and performance decreases are generally rather minor.

What is more worrisome is that the first part of the network (which has the explicit purpose of compressing information) is barely utilized according to the saturation values.
This may indicate a to deep system, which means that you should be able to remove layers from the compressing part of the network without losing to much performance.
Here is an example of how this looks like for a CNN classifier:
image
The best way of optimizing performance usually is first reducing the layers and then fiddling around with the width of the network.
In the end your goal should be a roughly even distributed saturation with an average roughly between 20 and 40%
Can you maybe provide some additional details on the neuroarchitecture?

@marthinwurer
Copy link
Author

marthinwurer commented Sep 12, 2021

So, I'm trying to implement the autoencoder from world models (https://worldmodels.github.io/):

image

This was the output from delve on that arch (minus the deconv, plus an extra layer cuz I couldn't read, and a straight-through rather than variational latent):

image

I wasn't getting as good of reconstruction results as the other paper, and so I wanted to use delve to try and figure out the bottlenecks. I switched to a different arch that mostly used regular conv layers (well, coordconv because that paper looked useful and I'm already a mess) so I could use it.

The arch in the graph I first posted:

        self.encoder = nn.Sequential(
            CoordConv2d(3, 32, 3, padding=1),  # 64
            activation(),
            SpectralPool2d(.5),  # 32
            CoordConv2d(32, 64, 3, padding=1),
            activation(),
            SpectralPool2d(.5),  # 16
            CoordConv2d(64, 128, 3, padding=1),
            activation(),
            SpectralPool2d(.5),  # 8
            CoordConv2d(128, 256, 3, padding=1),
            activation(),
            SpectralPool2d(.25),  # 2
            CoordConv2d(256, 256, 3, padding=1),
            activation(),
        )
        self.compress = nn.Linear(1024, latent_size)
        self.decompress = nn.Linear(latent_size, 1024)
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(1024, 256, 4),  # 4x4
            activation(),
            SpectralPool2d(2),  # 8x8
            CoordConv2d(256, 128, 3, padding=1),
            activation(),
            SpectralPool2d(2),  # 16
            CoordConv2d(128, 64, 3, padding=1),
            activation(),
            SpectralPool2d(2),  # 32
            CoordConv2d(64, 32, 3, padding=1),
            activation(),
            SpectralPool2d(2),  # 64
            CoordConv2d(32, 3, 3, padding=1),
        )

I'm mostly just messing around at this point, I'm definitely not a professional in this field. Thanks for the help!

@marthinwurer
Copy link
Author

For closure, I figured out what my issue was!

I ended up going back to basics and implementing the network without any modifications. That didn't solve my issue; it actually made it worse! It looked like the network wasn't training at all, and I ended up using some code from here to look at the gradients in the network, which showed that they were tiny. The issue ended up being with my optimizer: I had messed with the Adam hyperparameters in a misguided attempt to fix some previous issue. Resetting those fixed it, and I got much better results than I had before. Now the graph from delve looks like this:

image

Getting a graph of the gradients was super helpful; it might be another good statistic to track with delve. If you want, I can open another issue with a feature request for that.

@MLRichter
Copy link
Collaborator

MLRichter commented Sep 25, 2021

I have two minor updates regarding this.
First, recording the gradients is a bit trickier, since the backward-hook logic is substantially less stable in torch than the forward hook logic.
I have built some prototypes meanwhile, but it will take some time to make it into a stable feature.

After digging through the code, I think a mid-sized refactoring is necessary to allow the inclusion of non-standard layers into the saturation statistics like transposed convolutions or more custom, functional-style convolutions, like the ones used in the EfficientNet-models.
This requires a more modularized, less monolithic approach to layer recordings.
I am currently working on the concept.
If everything goes well, you can expect a PR in the next month or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants