Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_mdim cannot find array but cli gdalmdiminfo finds array just fine? #659

Open
cboettig opened this issue Jan 2, 2024 · 8 comments
Open

Comments

@cboettig
Copy link

cboettig commented Jan 2, 2024

The following fails with error:

public_S3_url <- "https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5"

co2 <- paste0("/vsicurl/", public_S3_url) |> stars::read_mdim("/GRSM/dp04/data/fluxCo2/nsae")
Error: Cannot find array

But the same array is easily located by gdalmdiminfo, so I don't understand why this error occurs:

system(paste("gdalmdiminfo",  paste0("/vsicurl/", public_S3_url), "-array", "/GRSM/dp04/data/fluxCo2/nsae"))

Testing using stars 0.6.4 with libraries:

Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
@edzer
Copy link
Member

edzer commented Jan 3, 2024

There's some pretty heavy nested grouping going on in that file!

@cboettig
Copy link
Author

cboettig commented Jan 4, 2024

There's some pretty heavy nested grouping going on in that file!

🤣 haha so true. so many ridiculous examples of nesting in hdf5, I sometimes think they intentionally serialize data in the worst way possible. (but then I remember that this is the pre-release format -- the current format is then manually .gz compressed to an h5.gz file). For context, these esoteric creatures are perhaps the flagship data product (eddy covariance measurements from flux towers) of the 81 NEON sites that our NSF is spending nearly half a billion dollars on... https://data.neonscience.org/data-products/DP4.00200.001

@edzer
Copy link
Member

edzer commented Jan 4, 2024

I can get it to work with a hack that passes on the group nesting, giving something like

> stars::read_mdim("/vsicurl/https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5", "nsae", groups = c("GRSM", "dp04", "data", "fluxCo2"))

Read failed for array nsae
stars object with 1 dimensions and 1 attribute
attribute(s):
      Min. 1st Qu. Median Mean 3rd Qu. Max.
nsae     0       0      0    0       0    0
dimension(s):
     from to offset delta
dim0    1 48    0.5     1
Warning message:
In CPL_read_mdim(file, array_name, groups, options, offset, count,  :
  GDAL Error 1: Array data type is not convertible to buffer data type

indicating that it gets to the array, opens it, but cannot read it.

If you look at the actual array values of the array, e.g. with

gdalmdiminfo -detailed /vsicurl/https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5 -array /GRSM/dp04/data/fluxCo2/nsae

I think it may become clear why this may be (and remain?) a problem: the array values are database tuples.

Have you tried reading these data with some hdf5 package (h5?) directly?

edzer added a commit to r-spatial/sf that referenced this issue Jan 4, 2024
edzer added a commit that referenced this issue Jan 4, 2024
edzer added a commit that referenced this issue Jan 4, 2024
@edzer
Copy link
Member

edzer commented Jan 4, 2024

These commits are probably going to be replaced by the way gdalmdiminfo does this.

@cboettig
Copy link
Author

cboettig commented Jan 4, 2024

Have you tried reading these data with some hdf5 package (h5?) directly?

rhdf5 will subset this file remotely rather well. Couldn't get this to work with the others.

  public_S3_url <- "https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5"
  
  #listGrp <- rhdf5::h5ls(file = public_S3_url, s3 = TRUE) # slow! 
  
  # accessing a given array is fast though
  nee <- rhdf5::h5read(file = public_S3_url,
                       name = glue::glue("{site}/dp04/data/fluxCo2/turb",
                                         site="GRSM"), 
                       s3 = TRUE) 

(rhdf5 can't handle the stupid case where these are gzipped, I know /vsigzip/ can be slow but it works here. still you're probably right that gdal mdiminfo isn't ideal for this!)

In other news, I'd love to see proxy support for stars::read_mdim, e.g. for large Zarr archives, e.g. so we can do R examples like https://planetarycomputer.microsoft.com/dataset/daymet-daily-na#Example-Notebook

@edzer
Copy link
Member

edzer commented Jan 5, 2024

In other news, I'd love to see proxy support for stars::read_mdim, e.g. for large Zarr archives

what would you like to see beyond the ability to read one or more slices, or sub-cubes (currently in offset, count and step)?

edzer added a commit that referenced this issue Jan 5, 2024
@cboettig
Copy link
Author

cboettig commented Jan 5, 2024

what would you like to see beyond the ability to read one or more slices, or sub-cubes (currently in offset, count and step)?

Thanks, and apologies as I should have probably opened an issue or stackoverflow post as this could just be my lack of understanding.

One part of this is just ergonomics. I imagine I can crop spatial extents by setting the appropriate combination of count and offset, and downsample with slice, but this doesn't match the syntax we have elsewhere in stars to do things like crop, right? What about things like aggregation? In python, it appears the xarray syntax for all these operations remains entirely consistent whether we're reading in from Zarr, reading netcdf, or even reading COGs from stac.

I think my idealized workflow would actually look something like:

library(spData)
box <- us_states[us_states$NAME=="California",] |> st_transform(4326)

library(stars)
url <- "https://daymeteuwest.blob.core.windows.net/daymet-zarr/monthly/na.zarr"

# desired but not functional syntax:
x <- read_stars(url) |> st_crop(ca)

and have a lazy-read object where I can slice times (or other dimensions) on indices. Perhaps my expectations here are just entirely unreasonable. Yes, I know we currently need the GDAL decorators to even get this to parse:

dsn <- glue::glue('ZARR:"/vsicurl/{url}":/tmax')
tmax <- read_stars(dsn)
x <- read_stars(url)

This works, but already confuses my students -- this isn't the way I previously taught them to specify dimensions (tmax) in read_stars, and the prefix/quotation syntax causes trouble. I understand that mdim is also the preferred interface for Zarr, but the syntax of offset / count / slice gets progressively harder to anticipate, and again differs from the way these concepts were introduced previously in stars objects:

x <- read_mdim(dsn, count = c(NA, NA, 492), step=c(4, 4, 12))

the ergonomics are again hard here because we hardwire into code numerical values that could be obtained from object metadata but are otherwise not immediately obvious to the user. But this also gives me a bunch of warnings:

x <- read_mdim(dsn, count = c(NA, NA, 492), step=c(4, 4, 12))
Read failed for array prcp
Read failed for array swe
Read failed for array tmax
Read failed for array tmin
Read failed for array vp
Warning messages:
1: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
2: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
3: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
4: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
5: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
6: In create_dimension(values = x, is_raster = !sf && i %in% 1:2, point = ifelse(length(x) ==  :
  dimension value(s) non-finite

and seems to have an invalid crs which I can't override:

> st_crs(x) <- 4326
Error in st_crs.character(x[[xy[1]]]$refsys) : invalid crs: udunits

and on attempting to plot it crashes on a platform with 100 GB RAM.

edzer added a commit to r-spatial/sf that referenced this issue Jan 5, 2024
edzer added a commit that referenced this issue Jan 5, 2024
@edzer
Copy link
Member

edzer commented Jan 5, 2024

This gives us

> stars::read_mdim("NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5", "/GRSM/dp04/data/fluxCo2/nsae")
                    timeBgn                  timeEnd        flux
1  2020-04-26T00:00:00.000Z 2020-04-26T00:29:59.000Z         NaN
2  2020-04-26T00:30:00.000Z 2020-04-26T00:59:59.000Z  -0.7405883
3  2020-04-26T01:00:00.000Z 2020-04-26T01:29:59.000Z  -0.2856589
4  2020-04-26T01:30:00.000Z 2020-04-26T01:59:59.000Z   2.7275024
5  2020-04-26T02:00:00.000Z 2020-04-26T02:29:59.000Z  14.2427594
6  2020-04-26T02:30:00.000Z 2020-04-26T02:59:59.000Z  -4.1640435
7  2020-04-26T03:00:00.000Z 2020-04-26T03:29:59.000Z -23.5961781
8  2020-04-26T03:30:00.000Z 2020-04-26T03:59:59.000Z -38.7475553
9  2020-04-26T04:00:00.000Z 2020-04-26T04:29:59.000Z  -5.9633267
10 2020-04-26T04:30:00.000Z 2020-04-26T04:59:59.000Z  -1.9345659
11 2020-04-26T05:00:00.000Z 2020-04-26T05:29:59.000Z   5.7533219
12 2020-04-26T05:30:00.000Z 2020-04-26T05:59:59.000Z   3.6831897
13 2020-04-26T06:00:00.000Z 2020-04-26T06:29:59.000Z  -5.8330796
14 2020-04-26T06:30:00.000Z 2020-04-26T06:59:59.000Z   2.9189290
15 2020-04-26T07:00:00.000Z 2020-04-26T07:29:59.000Z   3.0636229
16 2020-04-26T07:30:00.000Z 2020-04-26T07:59:59.000Z   3.0182718
17 2020-04-26T08:00:00.000Z 2020-04-26T08:29:59.000Z   2.9283239
18 2020-04-26T08:30:00.000Z 2020-04-26T08:59:59.000Z  -3.0542519
19 2020-04-26T09:00:00.000Z 2020-04-26T09:29:59.000Z   5.5374485
20 2020-04-26T09:30:00.000Z 2020-04-26T09:59:59.000Z   0.2513057
21 2020-04-26T10:00:00.000Z 2020-04-26T10:29:59.000Z   0.6168280
22 2020-04-26T10:30:00.000Z 2020-04-26T10:59:59.000Z  -5.4693606
23 2020-04-26T11:00:00.000Z 2020-04-26T11:29:59.000Z   0.3674658
24 2020-04-26T11:30:00.000Z 2020-04-26T11:59:59.000Z   1.4159392
25 2020-04-26T12:00:00.000Z 2020-04-26T12:29:59.000Z  -1.0238224
26 2020-04-26T12:30:00.000Z 2020-04-26T12:59:59.000Z   0.7054844
27 2020-04-26T13:00:00.000Z 2020-04-26T13:29:59.000Z  -8.9968282
28 2020-04-26T13:30:00.000Z 2020-04-26T13:59:59.000Z  12.3038478
29 2020-04-26T14:00:00.000Z 2020-04-26T14:29:59.000Z  -3.2665178
30 2020-04-26T14:30:00.000Z 2020-04-26T14:59:59.000Z  -3.2718385
31 2020-04-26T15:00:00.000Z 2020-04-26T15:29:59.000Z -10.0929240
32 2020-04-26T15:30:00.000Z 2020-04-26T15:59:59.000Z  -5.9556013
33 2020-04-26T16:00:00.000Z 2020-04-26T16:29:59.000Z  -8.2911191
34 2020-04-26T16:30:00.000Z 2020-04-26T16:59:59.000Z  -5.0356033
35 2020-04-26T17:00:00.000Z 2020-04-26T17:29:59.000Z  -8.3811585
36 2020-04-26T17:30:00.000Z 2020-04-26T17:59:59.000Z  -1.6936340
37 2020-04-26T18:00:00.000Z 2020-04-26T18:29:59.000Z  14.2182930
38 2020-04-26T18:30:00.000Z 2020-04-26T18:59:59.000Z -10.0673130
39 2020-04-26T19:00:00.000Z 2020-04-26T19:29:59.000Z  -1.8299035
40 2020-04-26T19:30:00.000Z 2020-04-26T19:59:59.000Z  -0.7272567
41 2020-04-26T20:00:00.000Z 2020-04-26T20:29:59.000Z   8.0110202
42 2020-04-26T20:30:00.000Z 2020-04-26T20:59:59.000Z         NaN
43 2020-04-26T21:00:00.000Z 2020-04-26T21:29:59.000Z  12.1214397
44 2020-04-26T21:30:00.000Z 2020-04-26T21:59:59.000Z  -2.1407618
45 2020-04-26T22:00:00.000Z 2020-04-26T22:29:59.000Z -65.3029157
46 2020-04-26T22:30:00.000Z 2020-04-26T22:59:59.000Z   0.4240327
47 2020-04-26T23:00:00.000Z 2020-04-26T23:29:59.000Z  -0.8366531
48 2020-04-26T23:30:00.000Z 2020-04-26T23:59:59.000Z -12.9685483

although it still needs a fix if the numeric data is not Float64.

I'll come back to your other comments - I agree, interfaces and usability are important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants