Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloaded tif files are black #8

Open
blumenstiel opened this issue Apr 26, 2024 · 4 comments
Open

Downloaded tif files are black #8

blumenstiel opened this issue Apr 26, 2024 · 4 comments

Comments

@blumenstiel
Copy link

I downloaded some data and noticed that some S2 data is completely black, e.g., grid cell 207D_1378R or 438U_1009R. The S1 data looks fine.

I used the filter_download function that is provided in this repo, I tested with and without by_row. I also tested Image.open(BytesIO(table[col][0].as_py())).show() with the same result.

The tif files do not include a FillValue. I assume 0 is used for NaN values?

Is it possible that some data got corrupted during the download or upload to HF?

@mikonvergence
Copy link
Collaborator

Hi @blumenstiel - thanks for bringing this up! I had a look too and it does seem like these two cells are indeed corrupted.

We made no changes to the original values, so like in the original Sentinel-2 data, 0 should represent no data (as far as I'm aware).

It is somewhat unlikely that the corruption occurred during the upload, so we will investigate soon. If needed we can update the corresponding parquet file.

Are there more files that are completely black that you found?

@blumenstiel
Copy link
Author

Hi @mikonvergence, thanks for looking into it!

I checked another 100 random samples and got 14 corrupted files:

,grid_cell
0,171D_798L
1,160D_805L
2,143D_811L
3,142D_810L
4,142D_803L
5,138D_800L
6,133D_803L
7,128D_793L
8,117D_811L
9,113D_786L
10,110D_813L
11,107D_796L
12,94D_810L
13,451U_259L

So I assume that this potentially affects 10-20% of the gird cells. I did not manually check the samples but based on my code, each of these grid cell should either have only NaN values in S1 or S2.

Maybe add a quick check after downloading/before uploading to your processing scripts?

@aliFrancis
Copy link
Collaborator

Hi, we're looking into this! Thanks for bringing to our attention.

Doing some digging, there is a small percentage of S2 tiles (1.3%) which have 100% no-data (==0). I guess you got very unlucky, or something about your search made them more likely? Regardless, not sure why this has happened in the first place and why it got past our checks. Seems that all the IDs you list here have nodata==1.0 in the metadata (except the last grid tile, which I manually verified and it has an image over the sea, albeit a dark one). So, for now, I recommend explicitly filtering out tiles with 100% nodata percentage (the value is a ratio between 0-1, as sometimes we get images that are partially nodata).

image

As I say, thanks for bringing this to our attention, we will look into correcting/removing these!!

@blumenstiel
Copy link
Author

Thank you @aliFrancis! I forgot to look at the no-data column, this explains a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants