Add Digital typhoon dataset #1748

nilsleh · 2023-11-30T21:47:51Z

This PR adds the Digital Typhoon Dataset.

The implementation allows the following features:

create an input sequence of single channel images concatenated along channel dimension for nowcasting task (predicting label of last image in that sequence)
filter samples by min or max feature values
datamodule that lets you split by storm id (disjoint sets over the time domain) or over the time domain (disjoint set of storm ids)

TODO:

Target Normalization for regression task

Sample Image:

calebrob6 · 2023-12-02T06:04:12Z

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

nilsleh · 2023-12-02T07:58:01Z

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

Stay tuned:)

torchgeo/datasets/digital_typhoon.py

nilsleh · 2023-12-18T08:02:50Z

@adamjstewart not sure how I can fix the read the docs error, do I need to add the TypedDict to init?

adamjstewart · 2023-12-21T17:08:52Z

This RtD error means that:

It's trying to document the data module class
And add a link to where the return type is documented
But the SampleSequenceDict class itself doesn't appear in the docs

Some options:

Make split_dataset a hidden method so it doesn't appear in the docs
Use Any instead
Add SampleSequenceDict to the docs

I'm leaning towards 1. What are your thoughts?

nilsleh · 2023-12-21T17:19:06Z

Thanks, yeah option 1 makes sense, because the TypedDict is nicer for understanding the code.

adamjstewart

This is clearly a very complicated dataset. I didn't really review the pandas stuff in detail. Since this is related to your ongoing time-series support work, I'll let you decide the best format for this dataset API.

adamjstewart · 2023-12-23T20:57:16Z

docs/api/datasets.rst

+Digitial Typhoon Analysis
+^^^^^^^^^^^^^^^^^^^^^^^^^


Suggested change

Digitial Typhoon Analysis

^^^^^^^^^^^^^^^^^^^^^^^^^

Digital Typhoon Analysis

^^^^^^^^^^^^^^^^^^^^^^^^

adamjstewart · 2023-12-23T21:00:26Z

docs/api/non_geo_datasets.csv

@@ -7,6 +7,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
 `Cloud Cover Detection`_,S,Sentinel-2,"CC-BY-4.0","22,728",2,512x512,10,MSI
 `COWC`_,"C, R","CSUAV AFRL, ISPRS, LINZ, AGRC","AGPL-3.0-only","388,435",2,256x256,0.15,RGB
 `Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI
+`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared 


Shouldn't this be:

Suggested change

`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared

`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-4.0","189,364",,512,5km,Infrared

according to http://agora.ex.nii.ac.jp/digital-typhoon/dataset/?

It's a classification dataset but there are no # classes?

It can be both a regression and classification task and for different tasks (analysis, reanalysis and forecasting) https://github.com/kitamoto-lab/benchmarks the number of classes would also vary depending on the target variable

So potentially if we want to integrate the other ones, there should be a baseclass?

Base class makes sense depending on how different they are. We can have a list or range for # classes, doesn't have to be a single number. We do that for some other columns.

adamjstewart · 2023-12-23T21:02:47Z

tests/data/digital_typhoon/data.py

+CHUNK_SIZE = 2**12
+
+# Define the root directory
+root = "./WP"


You defined this twice. I wouldn't include / since that's not true on Windows

adamjstewart · 2023-12-23T21:04:16Z

tests/data/digital_typhoon/data.py

+    shutil.rmtree(root)
+
+# Create the root directory if it doesn't exist
+os.makedirs(root)


Could remove this line, the line below will do this

adamjstewart · 2023-12-23T21:04:39Z

tests/data/digital_typhoon/data.py

+    # Create a directory under 'root/image/typhoon_id/'
+    os.makedirs(os.path.join(root, "image", str(typhoon_id)), exist_ok=True)
+
+    # Create dummy .hf files


Suggested change

# Create dummy .hf files

# Create dummy .h5 files

adamjstewart · 2023-12-23T21:15:01Z

torchgeo/datasets/digital_typhoon.py

+            features: which auxiliary features to return
+            target: which auxiliary features to use as targets


Why is features plural but target is singular?

adamjstewart · 2023-12-23T21:15:52Z

torchgeo/datasets/digital_typhoon.py

+            RuntimeError: if ``download=False`` and data is not found, or checksums
+                don't match


Suggested change

RuntimeError: if ``download=False`` and data is not found, or checksums

don't match

DatasetNotFoundError: If dataset is not found and *download* is False.

adamjstewart · 2023-12-23T21:16:53Z

torchgeo/datasets/digital_typhoon.py

+        self.min_feature_value = min_feature_value
+        self.max_feature_value = max_feature_value
+
+        assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}"


Error message could be made more clear as to which argument was wrong, but not required

adamjstewart · 2023-12-23T21:17:09Z

torchgeo/datasets/digital_typhoon.py

+        assert task in self.valid_tasks, f"Please choose one of {self.valid_tasks}"
+        self.task = task
+
+        assert set(features).issubset(set(self.valid_features))


I usually use <= instead of .issubset(...) but up to you

adamjstewart · 2023-12-23T21:17:50Z

torchgeo/datasets/digital_typhoon.py

+            for feature, max_value in self.max_feature_value.items():
+                self.aux_df = self.aux_df[self.aux_df[feature] <= max_value]
+
+        def get_subsequences(df: pd.DataFrame, k: int) -> list[dict[str, list[int]]]:


Can this be hidden? Don't want people to rely on it if not necessary

adamjstewart · 2024-01-15T13:49:27Z

@nilsleh when you find the time we should finish this up.

nilsleh · 2024-02-08T08:29:01Z

@adamjstewart I tried adressing your suggestions to finish this up. But wanted to get your thoughts on the following:

There is actually a complication that I am not entirely sure how to handle regarding target normalization.

I think having target normalization out of the box for regression tasks is a nice feature because it's a bit annoying to handle it yourself because you have to collect the targets yourself from the relevant sources inside the dataset or datamodule and overwrite some methods. for example in Tropical Cyclone dataset this would actually be a nice feature
however, this dataset does not have a predefined train/test split from the authors (unlike Tropical Cyclone) and we only implement a random split in the datamodule so people can use the dataset more easily
this implies that computing the target statistics over the entire dataset is technically information leakage to the test set

adamjstewart · 2024-02-08T09:44:54Z

Don't think I've ever used target normalization before, but if you have a random train/val/test split, you can either:

Use a fixed seed so that it's the same split every time, calculate stats only on train
Generate a random split, save it to disk, distribute on HF and combine with the dataset

1 is much easier, 2 is more formal.

nilsleh · 2024-02-08T10:19:34Z

In case 1 the normalization would only be available throughtthe Datamodule and you would have to implement it in the on after batch transform and it would not be available for the dataset class

Case 2 is not possible I think, because the target range will change depending on the args for min_feature_value and max_feature_value

adamjstewart · 2024-02-08T10:42:52Z

In both cases you can manually compute the normalization, then copy-n-paste it and store it in the dataset class. You don't need to compute it on the fly during training. The great thing about data modules is that they don't change too much. If there are parameters that select what features you are using, this is similar to which bands are used in the So2SatDataModule.

nilsleh · 2024-02-08T10:47:47Z

But in case 2 I actually need to compute it on the fly because the regression target range changes based on the range restriction.

adamjstewart · 2024-02-08T10:54:15Z

How is that different from case 1? The only thing that changes is whether the split is recorded on disk or not.

nilsleh · 2024-02-08T11:02:27Z

It's not different from case 1. Since the target range can change and there are no defined train/test splits on the dataset class level, the target normalization needs to happen in the datamodule. But that would imply that the normalization needs to move to datamodule instead of dataset (where it is currently). So I just wanted to inquire about that :)

adamjstewart · 2024-02-08T11:05:07Z

Gotcha. Yeah I would definitely move all transforms/data aug from the dataset to the data module to match our other datasets. Lack of a pre-defined train/val/test split shouldn't matter. Allowing the user to specify min/max feature values does matter. Do we need that? Can't we just compute that based on train and store it permanently with no option to override? Is that designed to serve the same purpose as mean/std?

nilsleh · 2024-02-08T12:13:22Z

These cyclone dataset usually have a very imbalanced target distribution (there are few images of high hurricane categories) and the min feature values would allow the user to basically only select a dataset with only hurricane category images and not images that are just clouds with wind speed of 0 for example. I basically rewrote the Tropical Cylcone dataset locally to have that functionality because it makes running experiments a lot easier, so I thought I would put that option in this dataset as well.

adamjstewart · 2024-02-08T13:19:43Z

Up to you I guess. You can use no normalization by default and allow the user to subclass and override the normalization values. Then the user is responsible for calculating mean/std themselves based on split/min/max.

Or you can just subtract min and divide by (max - min). Is there any reason why this wouldn't be a good idea?

nilsleh added 4 commits November 13, 2023 10:01

analysis task dataset

c82a5e7

merge main

1da1fa1

implement sequence sampling

c31f73a

add outline datamodule

a4bde5e

nilsleh marked this pull request as draft November 30, 2023 21:48

github-actions bot added datasets Geospatial or benchmark datasets datamodules PyTorch Lightning datamodules labels Nov 30, 2023

adamjstewart added this to the 0.6.0 milestone Nov 30, 2023

nilsleh added 3 commits December 1, 2023 15:42

add datamodule with two way splitting capabilities

e2a37a5

add plotting function

a2881af

download and verify

49254ad

github-actions bot added the testing Continuous integration testing label Dec 1, 2023

add unit tests but they fail

bcaaed9

fix tests

6272191

github-actions bot added the documentation Improvements or additions to documentation label Dec 2, 2023

nilsleh added 6 commits December 2, 2023 12:28

fix style

4477cf6

trainer testing yaml

0028a22

test split logic

a939656

fix tests

ba94b79

fix tests2

6963138

found bug

aeec4dd

nilsleh marked this pull request as ready for review December 2, 2023 14:18

nilsleh added 4 commits December 2, 2023 14:24

merge main

171fed8

try to fix mypy

82c5bed

h5py error docs

407d50f

fix docs

48cb869

nilsleh commented Dec 7, 2023

View reviewed changes

torchgeo/datasets/digital_typhoon.py Show resolved Hide resolved

merge main

f6959a4

nilsleh added 3 commits December 13, 2023 10:24

fix tests for trainers

ceaa7e6

fix mypy

0870131

merge main

072c8fc

nilsleh requested review from adamjstewart and isaaccorley December 13, 2023 11:11

try typed dict

2b40f5e

nilsleh added 3 commits December 21, 2023 18:52

try to fix docs

4cbdb17

fix pytest

3465139

linters

d3576d7

adamjstewart reviewed Dec 23, 2023

View reviewed changes

nilsleh added 4 commits February 8, 2024 07:37

suggested changes and normalization procedure

23f6433

merge

73138ff

merge main

81664fb

regression target normalization

71907af

nilsleh added 2 commits February 8, 2024 09:01

update dataset splitting

204fb0c

fix test

dace184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Digital typhoon dataset #1748

Add Digital typhoon dataset #1748

nilsleh commented Nov 30, 2023 •

edited

calebrob6 commented Dec 2, 2023

nilsleh commented Dec 2, 2023

nilsleh commented Dec 18, 2023

adamjstewart commented Dec 21, 2023

nilsleh commented Dec 21, 2023

adamjstewart left a comment

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

nilsleh Jan 23, 2024

nilsleh Jan 23, 2024

adamjstewart Jan 24, 2024

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart Dec 23, 2023

adamjstewart commented Jan 15, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

	`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-SA-4.0","189,364",,512,5km,Infrared
	`Digitial Typhoon Analysis`_,"C, R",Himawari,"CC-BY-4.0","189,364",,512,5km,Infrared

		features: which auxiliary features to return
		target: which auxiliary features to use as targets

		RuntimeError: if ``download=False`` and data is not found, or checksums
		don't match

	RuntimeError: if ``download=False`` and data is not found, or checksums
	don't match
	DatasetNotFoundError: If dataset is not found and download is False.

Add Digital typhoon dataset #1748

Are you sure you want to change the base?

Add Digital typhoon dataset #1748

Conversation

nilsleh commented Nov 30, 2023 • edited

calebrob6 commented Dec 2, 2023

nilsleh commented Dec 2, 2023

nilsleh commented Dec 18, 2023

adamjstewart commented Dec 21, 2023

nilsleh commented Dec 21, 2023

adamjstewart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamjstewart commented Jan 15, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Feb 8, 2024

adamjstewart commented Feb 8, 2024

nilsleh commented Nov 30, 2023 •

edited