New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_mdim cannot find array but cli gdalmdiminfo finds array just fine? #659
Comments
There's some pretty heavy nested grouping going on in that file! |
🤣 haha so true. so many ridiculous examples of nesting in hdf5, I sometimes think they intentionally serialize data in the worst way possible. (but then I remember that this is the pre-release format -- the current format is then manually .gz compressed to an h5.gz file). For context, these esoteric creatures are perhaps the flagship data product (eddy covariance measurements from flux towers) of the 81 NEON sites that our NSF is spending nearly half a billion dollars on... https://data.neonscience.org/data-products/DP4.00200.001 |
I can get it to work with a hack that passes on the group nesting, giving something like > stars::read_mdim("/vsicurl/https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5", "nsae", groups = c("GRSM", "dp04", "data", "fluxCo2"))
Read failed for array nsae
stars object with 1 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max.
nsae 0 0 0 0 0 0
dimension(s):
from to offset delta
dim0 1 48 0.5 1
Warning message:
In CPL_read_mdim(file, array_name, groups, options, offset, count, :
GDAL Error 1: Array data type is not convertible to buffer data type indicating that it gets to the array, opens it, but cannot read it. If you look at the actual array values of the array, e.g. with
I think it may become clear why this may be (and remain?) a problem: the array values are database tuples. Have you tried reading these data with some hdf5 package (h5?) directly? |
These commits are probably going to be replaced by the way |
rhdf5 will subset this file remotely rather well. Couldn't get this to work with the others. public_S3_url <- "https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5"
#listGrp <- rhdf5::h5ls(file = public_S3_url, s3 = TRUE) # slow!
# accessing a given array is fast though
nee <- rhdf5::h5read(file = public_S3_url,
name = glue::glue("{site}/dp04/data/fluxCo2/turb",
site="GRSM"),
s3 = TRUE) (rhdf5 can't handle the stupid case where these are gzipped, I know /vsigzip/ can be slow but it works here. still you're probably right that gdal mdiminfo isn't ideal for this!) In other news, I'd love to see proxy support for stars::read_mdim, e.g. for large Zarr archives, e.g. so we can do R examples like https://planetarycomputer.microsoft.com/dataset/daymet-daily-na#Example-Notebook |
what would you like to see beyond the ability to read one or more slices, or sub-cubes (currently in |
Thanks, and apologies as I should have probably opened an issue or stackoverflow post as this could just be my lack of understanding. One part of this is just ergonomics. I imagine I can crop spatial extents by setting the appropriate combination of count and offset, and downsample with slice, but this doesn't match the syntax we have elsewhere in stars to do things like crop, right? What about things like aggregation? In python, it appears the I think my idealized workflow would actually look something like: library(spData)
box <- us_states[us_states$NAME=="California",] |> st_transform(4326)
library(stars)
url <- "https://daymeteuwest.blob.core.windows.net/daymet-zarr/monthly/na.zarr"
# desired but not functional syntax:
x <- read_stars(url) |> st_crop(ca) and have a lazy-read object where I can slice times (or other dimensions) on indices. Perhaps my expectations here are just entirely unreasonable. Yes, I know we currently need the GDAL decorators to even get this to parse: dsn <- glue::glue('ZARR:"/vsicurl/{url}":/tmax')
tmax <- read_stars(dsn)
x <- read_stars(url) This works, but already confuses my students -- this isn't the way I previously taught them to specify dimensions (tmax) in read_stars, and the prefix/quotation syntax causes trouble. I understand that mdim is also the preferred interface for Zarr, but the syntax of x <- read_mdim(dsn, count = c(NA, NA, 492), step=c(4, 4, 12)) the ergonomics are again hard here because we hardwire into code numerical values that could be obtained from object metadata but are otherwise not immediately obvious to the user. But this also gives me a bunch of warnings:
and seems to have an invalid crs which I can't override:
and on attempting to plot it crashes on a platform with 100 GB RAM. |
This gives us
although it still needs a fix if the numeric data is not Float64. I'll come back to your other comments - I agree, interfaces and usability are important. |
The following fails with error:
But the same array is easily located by gdalmdiminfo, so I don't understand why this error occurs:
Testing using stars 0.6.4 with libraries:
The text was updated successfully, but these errors were encountered: