Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomGeoSampler for Pre-chipped Datasets #1976

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

yichiac
Copy link
Contributor

@yichiac yichiac commented Apr 2, 2024

This PR replaces RandomBatchGeoSampler with RandomGeoSampler for Agrifieldnet, Sentinel2_CDL, Sentinel2_NCCM, Sentinel2_EuroCrops, and Sentinel2_SouthAmericaSoybean datamodules. RandomGeoSampler works better for the pre-chipped datasets because it returns the correct random bounding boxes.

@github-actions github-actions bot added the datamodules PyTorch Lightning datamodules label Apr 2, 2024
@adamjstewart adamjstewart modified the milestone: 0.6.0 Apr 2, 2024
@adamjstewart
Copy link
Collaborator

We should do this for all the datamodules you're using, not just these two.

adamjstewart
adamjstewart previously approved these changes Apr 15, 2024
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is there anything else you wanted to add to this PR or can I merge?

@adamjstewart
Copy link
Collaborator

Tests fail because the fake data consists of a single file and 10% of 1 is 0.

@yichiac
Copy link
Contributor Author

yichiac commented Apr 15, 2024

That's all I want to add to this PR. Thanks for reviewing the code, @adamjstewart!

@github-actions github-actions bot added the testing Continuous integration testing label Apr 21, 2024
@adamjstewart
Copy link
Collaborator

This PR adds 1K new files to git. Do we really need that many files?

@yichiac
Copy link
Contributor Author

yichiac commented May 1, 2024

If we change the split portion from [0.8, 0.1, 0.1] to something like [0.6, 0.2, 0.2], we can reduce 50% of test images.
random_bbox_assignment(self.dataset, [0.8, 0.1, 0.1], generator=generator)
For the current split portion, we need at least 10 images to get at least 1 val/test image. For [0.6, 0.2, 0.2], we just need 5 images. The split portion requiring the minimum number of images would be [0.3, 0.3, 0.3], which only needs 3 images.

That being said, are 500 images too many to be added to git? We can manually change the split portion later using [0.8, 0.1, 0.1].

@adamjstewart
Copy link
Collaborator

If we change the split portion from [0.8, 0.1, 0.1] to something like [0.6, 0.2, 0.2], we can reduce 50% of test images.

We shouldn't change the default split just to make testing easier. If you want to make the split ratio an input parameter (like we do for many other data modules), then we could use [0.33, 0.33, 0.33] only in CI and use [0.8, 0.1, 0.1] by default.

we need at least 10 images

I'm fine with at least 10 images, I'm less fine with at least 1000 images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants