Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and loading arrays with boolean eltypes #189

Open
sethaxen opened this issue Oct 26, 2022 · 3 comments
Open

Saving and loading arrays with boolean eltypes #189

sethaxen opened this issue Oct 26, 2022 · 3 comments

Comments

@sethaxen
Copy link

As far as I can tell, netCDF does not support a boolean eltype, so boolean variables need to be written as integers. I'm working with some netCDF files saved using the netcdf library via xarray, which seems to handle this by saving boolean data as integers with the attribute dtype="bool". Is there an option to tell NCDatasets during load time to set the eltype based on an attribute like this?

@Alexander-Barth
Copy link
Owner

Can you give me an example code in python-xarray which writes and reads such a Boolean array?

@sethaxen
Copy link
Author

Can you give me an example code in python-xarray which writes and reads such a Boolean array?

Sure, here's an example:

>>> import xarray as xr
>>> import numpy as np
>>> x = np.random.normal(size=(4, 100)) > 0
>>> ds = xr.Dataset(
...     data_vars=dict(x=(['chain', 'draw'], x)),
...     coords=dict(chain=range(4), draw=range(100)),
... )
>>> ds.x.dtype
dtype('bool')
>>> ds.to_netcdf("foo.nc")
>>> ds2 = xr.open_dataset("foo.nc")
>>> ds2.x.dtype
dtype('bool')
>>> np.array_equal(ds.x, ds2.x)
True

When we load foo.nc using NCDatasets, we can see the dtype attribute:

julia> using NCDatasets

julia> ds = NCDataset("foo.nc");

julia> ds["x"]
x (100 × 4)
  Datatype:    Int8
  Dimensions:  draw × chain
  Attributes:
   dtype                = bool

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Nov 14, 2022

Thanks for the example!

When reading the variable in python-NetCDF4 package, it seems that the variable is also returned an integer. I am not aware than any other package (Matlab, Octave or R) threat the attribute dtype in a special way. dtype is also not mentioned in the CF standard which I aim to follow.

This reminds me of the discussion about _Unsigned = "true": it was introduced before NetCDF has real unsigned types (now we have them least for the HDF5 format), but leading to inconsistencies and errors. Some of these issues are fixed by now, by adding the unsigned data types also to OPENDAP.

It is also not quite clear to me how to handle _FillValue, valid_min, valid_max, valid_range properties in this case when dtype attribute modifies the element type of an array.

Unfortunately, h5py implemented boolean types is a incompatible way than xarray (using enums).

So I don't think, that we should import this xarray specific extension to NCDatasets.

Maybe, we can can give an API to the user so that the user can implement specific encoding/decoding functions, like

function transformation(v::NCDataset.Variable)
   if get(v.attrib,"dtype","") == "bool"
    # encode, decode function
      return x -> Int8(x), x -> Bool(x)
   else
     return identity, identity
end

Would this be worth the effort ?

The true fix would be to add a native boolean type to NetCDF/HDF5. Is there any feature request about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants