New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale-MAE model #2057
base: main
Are you sure you want to change the base?
Scale-MAE model #2057
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see changes in https://github.com/microsoft/torchgeo/pull/2052/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor renaming suggestions to make things more consistent with DOFA, and some major documentation improvement suggestions. I'm willing to help with both if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- I wonder if we should move both of these to the new "Sensor-Agnostic" section because they technically work for (RGB-only) imagery from any sensor
- Since both of these have evaluation results on fMoW, can we add additional columns with those performance metrics (assuming they are comparable)? If we move them to "Sensor-Agnostic", we may need two tables, one for things evaluated on GEO-Bench and one for things evaluated on fMoW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to add a short summary or table of which sensor-agnostic models provide which features. For example, DOFA enables explicit dynamic spectral band support (via model arch) and implicit dynamic resolution (via training data) while Scale-MAE has no dynamic spectral band support (RGB-only) but explicit dynamic resolution support (via model arch). Not sure about GASSL, maybe only implicit dynamic resolution (via training data)? It's worth mentioning that neither have dynamic temporal resolution support (maybe Satlas does?). I'm planning on highlighting this in our release notes, so I can also write something up if needed. Something like:
"The following pre-trained models offer dynamic spatial (resolution), temporal (time span), and/or spectral (wavelength) support, either via their training data (implicit) or via their model architecture (explicit):"
Model | Spatial | Temporal | Spectral |
---|---|---|---|
DOFA | implicit | - | explicit |
GASSL | implicit | - | - |
Scale-MAE | explicit | - | - |
We could also optionally specify the range of resolutions/time spans/wavelengths that the model was pre-trained on. Just want to give users more feedback as to which model to choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we save this for a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that, just don't let me forget before the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a PR after this one so we don't forget to finish it
Scale-MAE Vision Transformer | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scale-MAE Vision Transformer | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Scale-MAE | |
^^^^^^^^^ |
This is how we named DOFA, which also uses a ViT backbone
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT License. | ||
|
||
"""Pre-trained Scale-MAE Vision Transformer models.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Pre-trained Scale-MAE Vision Transformer models.""" | |
"""Pre-trained Scale-MAE models.""" |
return emb | ||
|
||
|
||
class ScaleMAEViT(VisionTransformer): # type: ignore[misc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ScaleMAEViT(VisionTransformer): # type: ignore[misc] | |
class ScaleMAE(VisionTransformer): # type: ignore[misc] |
Weights.__deepcopy__ = lambda *args, **kwargs: args[0] | ||
|
||
|
||
class ScaleMAE_ViTLarge16_Weights(WeightsEnum): # type: ignore[misc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ScaleMAE_ViTLarge16_Weights(WeightsEnum): # type: ignore[misc] | |
class ScaleMAELarge16_Weights(WeightsEnum): # type: ignore[misc] |
) | ||
|
||
|
||
def scalemae_vit_large_patch16( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def scalemae_vit_large_patch16( | |
def scalemae_large_patch16( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the image size customizable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add additional sizes? I know we only have pre-trained weights available for large, but we might as well add functions to instantiate other sizes like they do in the source repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image size can be changed and the positional embeddings will get interpolated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated diff to remove 244
Finally getting around to adding this.
Adds Scale-MAE model (ViT encoder only) and pretrained weights.
I've verified this reproduces KNN performance at different resolutions for UCMerced but will repeat for other datasets.
@RitwikGupta let me know if this looks good. I cleaned up some of the code a bit to work out of the box with our trainers (this required setting the res when initializing the model instead of dynamically but I think it should still be fine).
@calebrob6 lmk if you want to team up on this one.