Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Digital typhoon dataset #1748

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Nov 30, 2023

This PR adds the Digital Typhoon Dataset.

The implementation allows the following features:

  • create an input sequence of single channel images concatenated along channel dimension for nowcasting task (predicting label of last image in that sequence)
  • filter samples by min or max feature values
  • datamodule that lets you split by storm id (disjoint sets over the time domain) or over the time domain (disjoint set of storm ids)

TODO:

  • Target Normalization for regression task

Sample Image:

@nilsleh nilsleh marked this pull request as draft November 30, 2023 21:48
@github-actions github-actions bot added datasets Geospatial or benchmark datasets datamodules PyTorch Lightning datamodules labels Nov 30, 2023
@adamjstewart adamjstewart added this to the 0.6.0 milestone Nov 30, 2023
@github-actions github-actions bot added the testing Continuous integration testing label Dec 1, 2023
@calebrob6
Copy link
Member

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

@nilsleh
Copy link
Collaborator Author

nilsleh commented Dec 2, 2023

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

Stay tuned:)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 2, 2023
@nilsleh nilsleh marked this pull request as ready for review December 2, 2023 14:18
@nilsleh
Copy link
Collaborator Author

nilsleh commented Dec 18, 2023

@adamjstewart not sure how I can fix the read the docs error, do I need to add the TypedDict to init?

@adamjstewart
Copy link
Collaborator

This RtD error means that:

  • It's trying to document the data module class
  • And add a link to where the return type is documented
  • But the SampleSequenceDict class itself doesn't appear in the docs

Some options:

  1. Make split_dataset a hidden method so it doesn't appear in the docs
  2. Use Any instead
  3. Add SampleSequenceDict to the docs

I'm leaning towards 1. What are your thoughts?

@nilsleh
Copy link
Collaborator Author

nilsleh commented Dec 21, 2023

Thanks, yeah option 1 makes sense, because the TypedDict is nicer for understanding the code.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a very complicated dataset. I didn't really review the pandas stuff in detail. Since this is related to your ongoing time-series support work, I'll let you decide the best format for this dataset API.

Comment on lines 225 to 226
Digitial Typhoon Analysis
^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Digitial Typhoon Analysis
^^^^^^^^^^^^^^^^^^^^^^^^^
Digital Typhoon Analysis
^^^^^^^^^^^^^^^^^^^^^^^^

@@ -7,6 +7,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
`Cloud Cover Detection`_,S,Sentinel-2,"CC-BY-4.0","22,728",2,512x512,10,MSI
`COWC`_,"C, R","CSUAV AFRL, ISPRS, LINZ, AGRC","AGPL-3.0-only","388,435",2,256x256,0.15,RGB
`Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared
`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-4.0","189,364",,512,5km,Infrared

according to http://agora.ex.nii.ac.jp/digital-typhoon/dataset/?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a classification dataset but there are no # classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be both a regression and classification task and for different tasks (analysis, reanalysis and forecasting) https://github.com/kitamoto-lab/benchmarks the number of classes would also vary depending on the target variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So potentially if we want to integrate the other ones, there should be a baseclass?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base class makes sense depending on how different they are. We can have a list or range for # classes, doesn't have to be a single number. We do that for some other columns.

CHUNK_SIZE = 2**12

# Define the root directory
root = "./WP"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You defined this twice. I wouldn't include / since that's not true on Windows

shutil.rmtree(root)

# Create the root directory if it doesn't exist
os.makedirs(root)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove this line, the line below will do this

# Create a directory under 'root/image/typhoon_id/'
os.makedirs(os.path.join(root, "image", str(typhoon_id)), exist_ok=True)

# Create dummy .hf files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create dummy .hf files
# Create dummy .h5 files

Comment on lines 113 to 114
features: which auxiliary features to return
target: which auxiliary features to use as targets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is features plural but target is singular?

Comment on lines 125 to 126
RuntimeError: if ``download=False`` and data is not found, or checksums
don't match
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RuntimeError: if ``download=False`` and data is not found, or checksums
don't match
DatasetNotFoundError: If dataset is not found and *download* is False.

self.min_feature_value = min_feature_value
self.max_feature_value = max_feature_value

assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message could be made more clear as to which argument was wrong, but not required

assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}"
self.task = task

assert set(features).issubset(set(self.valid_features))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually use <= instead of .issubset(...) but up to you

for feature, max_value in self.max_feature_value.items():
self.aux_df = self.aux_df[self.aux_df[feature] <= max_value]

def get_subsequences(df: pd.DataFrame, k: int) -> list[dict[str, list[int]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be hidden? Don't want people to rely on it if not necessary

@adamjstewart
Copy link
Collaborator

@nilsleh when you find the time we should finish this up.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 8, 2024

@adamjstewart I tried adressing your suggestions to finish this up. But wanted to get your thoughts on the following:

There is actually a complication that I am not entirely sure how to handle regarding target normalization.

  • I think having target normalization out of the box for regression tasks is a nice feature because it's a bit annoying to handle it yourself because you have to collect the targets yourself from the relevant sources inside the dataset or datamodule and overwrite some methods. for example in Tropical Cyclone dataset this would actually be a nice feature
  • however, this dataset does not have a predefined train/test split from the authors (unlike Tropical Cyclone) and we only implement a random split in the datamodule so people can use the dataset more easily
  • this implies that computing the target statistics over the entire dataset is technically information leakage to the test set

@adamjstewart
Copy link
Collaborator

Don't think I've ever used target normalization before, but if you have a random train/val/test split, you can either:

  1. Use a fixed seed so that it's the same split every time, calculate stats only on train
  2. Generate a random split, save it to disk, distribute on HF and combine with the dataset

1 is much easier, 2 is more formal.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 8, 2024

In case 1 the normalization would only be available throughtthe Datamodule and you would have to implement it in the on after batch transform and it would not be available for the dataset class

Case 2 is not possible I think, because the target range will change depending on the args for min_feature_value and max_feature_value

@adamjstewart
Copy link
Collaborator

In both cases you can manually compute the normalization, then copy-n-paste it and store it in the dataset class. You don't need to compute it on the fly during training. The great thing about data modules is that they don't change too much. If there are parameters that select what features you are using, this is similar to which bands are used in the So2SatDataModule.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 8, 2024

But in case 2 I actually need to compute it on the fly because the regression target range changes based on the range restriction.

@adamjstewart
Copy link
Collaborator

How is that different from case 1? The only thing that changes is whether the split is recorded on disk or not.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 8, 2024

It's not different from case 1. Since the target range can change and there are no defined train/test splits on the dataset class level, the target normalization needs to happen in the datamodule. But that would imply that the normalization needs to move to datamodule instead of dataset (where it is currently). So I just wanted to inquire about that :)

@adamjstewart
Copy link
Collaborator

Gotcha. Yeah I would definitely move all transforms/data aug from the dataset to the data module to match our other datasets. Lack of a pre-defined train/val/test split shouldn't matter. Allowing the user to specify min/max feature values does matter. Do we need that? Can't we just compute that based on train and store it permanently with no option to override? Is that designed to serve the same purpose as mean/std?

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 8, 2024

These cyclone dataset usually have a very imbalanced target distribution (there are few images of high hurricane categories) and the min feature values would allow the user to basically only select a dataset with only hurricane category images and not images that are just clouds with wind speed of 0 for example. I basically rewrote the Tropical Cylcone dataset locally to have that functionality because it makes running experiments a lot easier, so I thought I would put that option in this dataset as well.

@adamjstewart
Copy link
Collaborator

Up to you I guess. You can use no normalization by default and allow the user to subclass and override the normalization values. Then the user is responsible for calculating mean/std themselves based on split/min/max.

Or you can just subtract min and divide by (max - min). Is there any reason why this wouldn't be a good idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants