Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for extracting true footprint for sampling valid data only #1881

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

adriantre
Copy link
Contributor

@adriantre adriantre commented Feb 14, 2024

Fix #1330

RasterDatasets may contain nodata regions due to projecting all file to the same CRS, and due to eventual inherit nodata regions in the images.
When IntersectionDataset joins this with VectorDataset, this may yield

  1. false positive samples (bad for learning)
  2. empty negative samples (may be bad for learning)

The solution can be summarised as:

  • In RasterDataset, when opening each file, extract footprint and add to rtree index object
  • In IntersectionDataset._merge_dataset_indices copy over the footprint to the new rtree index.
  • In the same method, could optimise by minimizing bbox to cover only actual intersection of valid data.
  • In RandomGeoSampler.__iter__, use this footprint to validate that sample bbox actually overlaps, and don't yield until a valid box is found.
  • Enable the same for GridGeoSampler (probably other PR)
  • Remove label mask for eventual nodata-regions that outside regions in boundary. (As the criteria above is overlaps and not contains, corners of the resulting sample may still contain nodata, while the label mask still may cover this.) (probably other PR)
  • Add ability to balance positive and negative samples. The VectorData can be intersected with the raster valid data footprint in the GeoSampler to facilitate balancing positives and negatives. Right now torchgeo gives the user no control of this. (probably other PR)

Useful resources:
Rasterio nodata masks:
https://rasterio.readthedocs.io/en/latest/topics/masks.html#nodata-masks

Extract valid data footprint as vector
https://gist.github.com/sgillies/9713809

Reproject valid data footprint with rasterio
https://geopandas.org/en/stable/docs/user_guide/reproject_fiona.html#rasterio-example

@github-actions github-actions bot added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Feb 14, 2024
@adamjstewart adamjstewart added this to the 0.6.0 milestone Feb 14, 2024
torchgeo/datasets/utils.py Show resolved Hide resolved
# Read valid/nodata-mask
mask = src.read_masks()
# Close holes
sieved_mask = sieve(mask, 500)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't have to know the minimum size accepted for any raster... Maybe it can have a factor to compute this based on the mask shape?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target here is a polygon with no holes. Probably 500 is never too big (22 x 22 pixels). Could increase it, too.

If there are more (bigger) holes left, we could close them using shapely after converting to vector. What do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I thought of -- it if possible -- was to use the size of the window to compute the size to close the polygons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you are probably on to something. I'm struggling to decide what effect it might have if we set size too big or too small

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of something like: closing holes that are bigger than the window size, we can still be getting some cases retrieving samples with just nodata ... considering the multi-polygon thing here.

One example is if we beforehand masks Sentinel-2 clouds as no data when closing the holes considering a size bigger than the window size we still can get random samples inside/within this nodata regions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds smart!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is that the desired patch_size to be used by sampler is not available at this point in the code. This happens on RasterDataset init, separate from the Sampler init.

torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/samplers/single.py Outdated Show resolved Hide resolved
# Get the first valid nodata value.
# Usualy the same value for all bands
nodata = valid_nodatavals[0]
vrt = WarpedVRT(src, nodata=nodata, crs=self.crs)
Copy link
Contributor Author

@adriantre adriantre Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentinel-2 has as far I can see no value set for nodata. I looked everywhere. Even enabling alpha-layer in the Sentinel-2 gdal driver, and looking through the MSK_QUALIT-file I found nothing.

Copy link
Contributor Author

@adriantre adriantre Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will set the nodata-value. Some datasets have other nodata-values, and we should probably let the user overwrite this, for example in their subclass of RasterDataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the nodata is only overridden for the warped datasources. The non-warped (already correct CRS) are opened as is, but would also need to have the nodata overridden.

Comment on lines 156 to 157
hit = self.hits[idx]
bounds = BoundingBox(*hit.bounds)
Copy link
Contributor Author

@adriantre adriantre Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the first hit (file) is chosen. Currently I only use the footprint previously extracted for this file. But the sample is read from the merged raster, and the footprint for this one file might not cover the other.

Is my understanding correct?

In that case we would need to fetch all hits that overlaps with the randomly chosen hits bounds, and combine their footprints, crop it to the bounds, and pass the resulting footprint to get_random_bounding_box_check_valid_overlap.

Copy link
Contributor Author

@adriantre adriantre Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footprint for a hit is static. So this could be joined and cropped during init of IntersectionDataset._merge_dataset_indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume I have understood this correctly, and added this functionality in _merge_dataset_indices.

torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
# Read valid/nodata-mask
mask = src.read_masks()
# Close holes
sieved_mask = sieve(mask, 500)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds smart!

@adamjstewart adamjstewart mentioned this pull request Apr 1, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to avoid nodata-only patches
3 participants