Datamodule augmentation defaults #1841

calebrob6 · 2024-02-01T19:57:16Z

Summary

Currently the datamodules divide by 255 by default. This can confuse users who would expect no augmentations by default. We should make this WARNING level clear or not divide by default.

Rationale

No response

Implementation

No response

Alternatives

No response

Additional information

No response

adamjstewart · 2024-02-01T20:04:42Z

For the record, torchvision also divides by 255 by default. But yeah, I'm fine with dividing by 1 by default and allowing data modules to override the std dev.

calebrob6 · 2024-02-01T20:06:03Z

They only divide by 255 if it is uint8, which makes more sense (but still trips me up sometimes).

adamjstewart · 2024-02-01T20:11:24Z

I'm fine with changing the default to 1, just need to add a backwards incompatible tag to warn people. And convert all existing data modules to 255 (assuming that this is right). Want to make a PR?

Note that some of @lcoandrade's confusion in #1822 was that Raster Vision divides by 255 automatically but he wasn't sure if TorchGeo did too. Maybe he has opinions on this.

lcoandrade · 2024-02-07T13:18:58Z

Hi there!

I think the default behavior should be normalize to [0, 1] according to the data type used. This would avoid confusion.

I also think this information should be clear in the API. I could only check the normalization behavior checking the code on GitHub after @adamjstewart mentioned it to me.

johnnv1 · 2024-02-10T18:36:53Z

Just my two cents here: "Automatic" normalizing data is a really painfully experience when working of mostly of raw satellite data. A simple example, is the sentinel-2, where we have int32 data from the L2A data, but it shouldn't be divided by 65536 to be normalized, but through dividing the data by 10k, then clipping values to be [0-1].

One thing that might help is defaults to the image inputs expected be always 0-1, or maybe allow the user entry within a callable when initializing it to just handle the data -- this is something I always need to hack through to be able to use the libraries

DimitrisMantas · 2024-02-27T23:39:38Z

I can do a PR with per-channel min-max scaling to [0,1] as the default augmentation.

I'm actually using this type of scaling for my own work but my implementation requires the user to have the global mins and maxes specified somewhere, as is currently the case with mean and standard deviation.

I think you should be responsible for your own data and knowing it's value range is not that harsh of a requirement in my opinion, so let me know if this OK and I can proceed with it.

adamjstewart · 2024-02-28T09:28:11Z

I can do a PR with per-channel min-max scaling to [0,1] as the default augmentation.

This isn't a good idea. We don't want batch-wise normalization, we want a single normalization parameter for the entire dataset.

Let's default to Normalization(mean=0, std=1) (i.e., subtract zero and divide by 1, i.e., do nothing). Then each data module can change this default. For backwards compatibility, we should probably set std=255 for most datasets, as this seems to be the most common range anyway.

DimitrisMantas · 2024-02-28T20:53:26Z

Ah, please excuse the confusion; by "per-channel" I meant that if you have an n-band raster, each of its channels gets normalised according to its own minimum and maximum values.

This approach was required in my case because half my bands are aerial unsigned 8-bit imagery whereas the remaining ones represent rasterised LiDAR attributes (e.g., reflectance, slope, etc.), each with its own unit of measurement and value range.

But it’s true that all batches are treated the same way and are scaled according to global parameters, like you mention.

That being said, I think doing nothing may be a more general approach.

adamjstewart added the datamodules PyTorch Lightning datamodules label Feb 1, 2024

adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Feb 7, 2024

adamjstewart added this to the 0.6.0 milestone Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datamodule augmentation defaults #1841

Datamodule augmentation defaults #1841

calebrob6 commented Feb 1, 2024

adamjstewart commented Feb 1, 2024

calebrob6 commented Feb 1, 2024

adamjstewart commented Feb 1, 2024

lcoandrade commented Feb 7, 2024 •

edited

johnnv1 commented Feb 10, 2024

DimitrisMantas commented Feb 27, 2024 •

edited

adamjstewart commented Feb 28, 2024

DimitrisMantas commented Feb 28, 2024

Datamodule augmentation defaults #1841

Datamodule augmentation defaults #1841

Comments

calebrob6 commented Feb 1, 2024

Summary

Rationale

Implementation

Alternatives

Additional information

adamjstewart commented Feb 1, 2024

calebrob6 commented Feb 1, 2024

adamjstewart commented Feb 1, 2024

lcoandrade commented Feb 7, 2024 • edited

johnnv1 commented Feb 10, 2024

DimitrisMantas commented Feb 27, 2024 • edited

adamjstewart commented Feb 28, 2024

DimitrisMantas commented Feb 28, 2024

lcoandrade commented Feb 7, 2024 •

edited

DimitrisMantas commented Feb 27, 2024 •

edited