Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing the masking out #222

Open
brunosan opened this issue Apr 19, 2024 · 1 comment
Open

Removing the masking out #222

brunosan opened this issue Apr 19, 2024 · 1 comment

Comments

@brunosan
Copy link
Member

A MAE with a Unet like ours is actually a dual learning strategy. 1) creating accurate embeddings at patch level to reconstruct the input image, and 2) masking out to learn semantics across patches through interpolation.
The latter, masking out works really well for semantics that cover several patches, but can fault terribly for semantics fully contained on one patch without much relation elsewhere. E.g. a small forest clearing or fire, an aquaculture, ... Moreover in those cases where some neighbors are semantically mostly empty (water) the masked self-attention might put more aquaculture semantics on the empty water patch next to it, that the patch itself which also has the coast and other stuff within it.

The current 75% masking ratio overly emphasizes interpolation, diluting the model's focus on learning discrete, isolated semantic features critical for our applications.

I propose we greatly reduce (10% tops) or eliminate masking to prioritize direct learning from unmasked, full patch data. May be even tightening the self attention weights.

Lowering or removing the masking ratio will allow the model to more effectively learn and retain high-fidelity semantic information from each individual patch, aligning with our priority of achieving precise semantic understanding at the patch level, specially when fully contained.

@leothomas @MaceGrim @yellowcap @srmsoumya

@brunosan
Copy link
Member Author

I chatted with @yellowcap and @lukaskondmann and I think I was wrong.

If you don't mask enough, the task is too easy for the MAE and it will not learn meaningful things. So I would not do this. You could in principle make the embedding space very constrained and that would make it more difficult but this resembles more the style of other autoencoders and not MAE. Would not combine super well I think.

image
source

This, combined with the fact that we can, in v1, input smaller chip sizes, means that the patch embeddings is less relevant.

The underlying factor is that patch embddings are not designed to be used isolated. The opposite, the are desgined to contain the context around them, therefore are not well suited to be used for isolated similarity search. Opening a ticket on that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant