Removing the masking out #222

brunosan · 2024-04-19T12:33:22Z

A MAE with a Unet like ours is actually a dual learning strategy. 1) creating accurate embeddings at patch level to reconstruct the input image, and 2) masking out to learn semantics across patches through interpolation.
The latter, masking out works really well for semantics that cover several patches, but can fault terribly for semantics fully contained on one patch without much relation elsewhere. E.g. a small forest clearing or fire, an aquaculture, ... Moreover in those cases where some neighbors are semantically mostly empty (water) the masked self-attention might put more aquaculture semantics on the empty water patch next to it, that the patch itself which also has the coast and other stuff within it.

The current 75% masking ratio overly emphasizes interpolation, diluting the model's focus on learning discrete, isolated semantic features critical for our applications.

I propose we greatly reduce (10% tops) or eliminate masking to prioritize direct learning from unmasked, full patch data. May be even tightening the self attention weights.

Lowering or removing the masking ratio will allow the model to more effectively learn and retain high-fidelity semantic information from each individual patch, aligning with our priority of achieving precise semantic understanding at the patch level, specially when fully contained.

@leothomas @MaceGrim @yellowcap @srmsoumya

brunosan · 2024-04-19T16:25:26Z

I chatted with @yellowcap and @lukaskondmann and I think I was wrong.

If you don't mask enough, the task is too easy for the MAE and it will not learn meaningful things. So I would not do this. You could in principle make the embedding space very constrained and that would make it more difficult but this resembles more the style of other autoencoders and not MAE. Would not combine super well I think.

source

This, combined with the fact that we can, in v1, input smaller chip sizes, means that the patch embeddings is less relevant.

The underlying factor is that patch embddings are not designed to be used isolated. The opposite, the are desgined to contain the context around them, therefore are not well suited to be used for isolated similarity search. Opening a ticket on that now.

brunosan mentioned this issue Apr 19, 2024

Patch embeddings are not meant for similarity search #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing the masking out #222

Removing the masking out #222

brunosan commented Apr 19, 2024

brunosan commented Apr 19, 2024

Removing the masking out #222

Removing the masking out #222

Comments

brunosan commented Apr 19, 2024

brunosan commented Apr 19, 2024