-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Digital typhoon dataset #1748
base: main
Are you sure you want to change the base?
Conversation
This is really cool! I wonder if there is any generalization between this and the Cyclone dataset |
Stay tuned:) |
@adamjstewart not sure how I can fix the read the docs error, do I need to add the TypedDict to init? |
This RtD error means that:
Some options:
I'm leaning towards 1. What are your thoughts? |
Thanks, yeah option 1 makes sense, because the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clearly a very complicated dataset. I didn't really review the pandas stuff in detail. Since this is related to your ongoing time-series support work, I'll let you decide the best format for this dataset API.
docs/api/datasets.rst
Outdated
Digitial Typhoon Analysis | ||
^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digitial Typhoon Analysis | |
^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Digital Typhoon Analysis | |
^^^^^^^^^^^^^^^^^^^^^^^^ |
docs/api/non_geo_datasets.csv
Outdated
@@ -7,6 +7,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands | |||
`Cloud Cover Detection`_,S,Sentinel-2,"CC-BY-4.0","22,728",2,512x512,10,MSI | |||
`COWC`_,"C, R","CSUAV AFRL, ISPRS, LINZ, AGRC","AGPL-3.0-only","388,435",2,256x256,0.15,RGB | |||
`Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI | |||
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be:
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared | |
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-4.0","189,364",,512,5km,Infrared |
according to http://agora.ex.nii.ac.jp/digital-typhoon/dataset/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a classification dataset but there are no # classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be both a regression and classification task and for different tasks (analysis, reanalysis and forecasting) https://github.com/kitamoto-lab/benchmarks the number of classes would also vary depending on the target variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So potentially if we want to integrate the other ones, there should be a baseclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base class makes sense depending on how different they are. We can have a list or range for # classes, doesn't have to be a single number. We do that for some other columns.
tests/data/digital_typhoon/data.py
Outdated
CHUNK_SIZE = 2**12 | ||
|
||
# Define the root directory | ||
root = "./WP" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You defined this twice. I wouldn't include /
since that's not true on Windows
tests/data/digital_typhoon/data.py
Outdated
shutil.rmtree(root) | ||
|
||
# Create the root directory if it doesn't exist | ||
os.makedirs(root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could remove this line, the line below will do this
tests/data/digital_typhoon/data.py
Outdated
# Create a directory under 'root/image/typhoon_id/' | ||
os.makedirs(os.path.join(root, "image", str(typhoon_id)), exist_ok=True) | ||
|
||
# Create dummy .hf files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Create dummy .hf files | |
# Create dummy .h5 files |
torchgeo/datasets/digital_typhoon.py
Outdated
features: which auxiliary features to return | ||
target: which auxiliary features to use as targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is features plural but target is singular?
torchgeo/datasets/digital_typhoon.py
Outdated
RuntimeError: if ``download=False`` and data is not found, or checksums | ||
don't match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RuntimeError: if ``download=False`` and data is not found, or checksums | |
don't match | |
DatasetNotFoundError: If dataset is not found and *download* is False. |
torchgeo/datasets/digital_typhoon.py
Outdated
self.min_feature_value = min_feature_value | ||
self.max_feature_value = max_feature_value | ||
|
||
assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message could be made more clear as to which argument was wrong, but not required
assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}" | ||
self.task = task | ||
|
||
assert set(features).issubset(set(self.valid_features)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually use <=
instead of .issubset(...)
but up to you
torchgeo/datasets/digital_typhoon.py
Outdated
for feature, max_value in self.max_feature_value.items(): | ||
self.aux_df = self.aux_df[self.aux_df[feature] <= max_value] | ||
|
||
def get_subsequences(df: pd.DataFrame, k: int) -> list[dict[str, list[int]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be hidden? Don't want people to rely on it if not necessary
@nilsleh when you find the time we should finish this up. |
@adamjstewart I tried adressing your suggestions to finish this up. But wanted to get your thoughts on the following: There is actually a complication that I am not entirely sure how to handle regarding target normalization.
|
Don't think I've ever used target normalization before, but if you have a random train/val/test split, you can either:
1 is much easier, 2 is more formal. |
In case 1 the normalization would only be available throughtthe Datamodule and you would have to implement it in the on after batch transform and it would not be available for the dataset class Case 2 is not possible I think, because the target range will change depending on the args for |
In both cases you can manually compute the normalization, then copy-n-paste it and store it in the dataset class. You don't need to compute it on the fly during training. The great thing about data modules is that they don't change too much. If there are parameters that select what features you are using, this is similar to which bands are used in the So2SatDataModule. |
But in case 2 I actually need to compute it on the fly because the regression target range changes based on the range restriction. |
How is that different from case 1? The only thing that changes is whether the split is recorded on disk or not. |
It's not different from case 1. Since the target range can change and there are no defined train/test splits on the dataset class level, the target normalization needs to happen in the datamodule. But that would imply that the normalization needs to move to datamodule instead of dataset (where it is currently). So I just wanted to inquire about that :) |
Gotcha. Yeah I would definitely move all transforms/data aug from the dataset to the data module to match our other datasets. Lack of a pre-defined train/val/test split shouldn't matter. Allowing the user to specify min/max feature values does matter. Do we need that? Can't we just compute that based on train and store it permanently with no option to override? Is that designed to serve the same purpose as mean/std? |
These cyclone dataset usually have a very imbalanced target distribution (there are few images of high hurricane categories) and the min feature values would allow the user to basically only select a dataset with only hurricane category images and not images that are just clouds with wind speed of 0 for example. I basically rewrote the Tropical Cylcone dataset locally to have that functionality because it makes running experiments a lot easier, so I thought I would put that option in this dataset as well. |
Up to you I guess. You can use no normalization by default and allow the user to subclass and override the normalization values. Then the user is responsible for calculating mean/std themselves based on split/min/max. Or you can just subtract min and divide by (max - min). Is there any reason why this wouldn't be a good idea? |
This PR adds the Digital Typhoon Dataset.
The implementation allows the following features:
TODO:
Sample Image: