Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to open binary data with xm.open_mdsdataset #296

Open
SOFIADAL opened this issue Jan 11, 2022 · 5 comments
Open

Unable to open binary data with xm.open_mdsdataset #296

SOFIADAL opened this issue Jan 11, 2022 · 5 comments

Comments

@SOFIADAL
Copy link

Hello,
I have recently tried to open (MITgcm) binary data using xmitgcm package in ipython. However I am getting an Assertion Error which I have no idea where it is coming from. An example of a data file of temperature is named like this:
TT_in_R.0007730880.data
TT_in_R.0007730880.meta

and when I use low-level utilities to have a look on the files inside (using ipython) I get the following:

TOTF = mds.rdmds('TT_in_R.0007730880')
TOTF
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

Similarly for the .meta file:

TOTm = mds.readmeta('TT_in_R.0007730880.meta')
TOTm
((1999, 1999),
 [0, 0],
 [1999, 1999],
 [7730880],
 None,
 None,
 {'nDims': [2],
  'format': ['float32'],
  'nrecords': [1],
  'dimList': [1999, 1999]})

However, when I try to open the same file with the xm.open_mdsdataset I get the following error:

datatot= xm.open_mdsdataset(data_dir='/home/PROGRAMS/MLHB_DATA',prefix=['TT_in_R'],iters=[7730880],delta_t=90,ref_date='1979-01-15 00:00')

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-47-b8dba3ae9da7> in <module>
----> 1 datatot= xm.open_mdsdataset(data_dir='/home/sofi/Documents/POSTDOC_ATHENS/PROGRAMS/MLHB_DATA',prefix=['TT_in_R'],iters=[7730880],delta_t=90,ref_date='1979-01-15 00:00')

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/mds_store.py in open_mdsdataset(data_dir, grid_dir, iters, prefix, read_grid, delta_t, ref_date, calendar, levels, geometry, grid_vars_to_coords, swap_dims, endian, chunks, ignore_unknown_vars, default_dtype, nx, ny, nz, llc_method, extra_metadata, extra_variables)
    271                 return ds
    272 
--> 273     store = _MDSDataStore(data_dir, grid_dir, iternum, delta_t, read_grid,
    274                           prefix, ref_date, calendar,
    275                           geometry, endian,

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/mds_store.py in __init__(self, data_dir, grid_dir, iternum, delta_t, read_grid, file_prefixes, ref_date, calendar, geometry, endian, ignore_unknown_vars, default_dtype, nx, ny, nz, llc_method, levels, extra_metadata, extra_variables)
    583         for p in prefixes:
    584             # use a generator to loop through the variables in each file
--> 585             for (vname, dims, data, attrs) in \
    586                     self.load_from_prefix(p, iternum, extra_metadata):
    587                 # print(vname, dims, data.shape)

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/mds_store.py in load_from_prefix(self, prefix, iternum, extra_metadata)
    658 
    659         try:
--> 660             vardata = read_mds(basename, iternum, endian=self.endian,
    661                                llc=self.llc, llc_method=self.llc_method,
    662                                extra_metadata=extra_metadata, chunks=chunks)

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/utils.py in read_mds(fname, iternum, use_mmap, endian, shape, dtype, use_dask, extra_metadata, chunks, llc, llc_method, legacy)
    205     # get metadata
    206     try:
--> 207         metadata = parse_meta_file(metafile)
    208         nrecs, shape, name, dtype, fldlist = \
    209             _get_useful_info_from_meta_file(metafile)

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/utils.py in parse_meta_file(fname)
     49     needed_keys = ['dimList', 'nDims', 'nrecords', 'dataprec']
     50     for k in needed_keys:
---> 51         assert k in flds
     52     # transform datatypes
     53     flds['nDims'] = int(flds['nDims'])

AssertionError: 

I suspect that this AssertionError has something to do with the prefix definition, as I have tried to run the command just by specifying data_dir and prefix.

In addition I also tried to include a “.” inside the filename in prefix and I got the following:

datatot= xm.open_mdsdataset(data_dir='/home/PROGRAMS/MLHB_DATA',prefix=['TT_in_R.'],iters=[7730880],delta_t=90,ref_date='1979-01-15 00:00')
datatot
<xarray.Dataset>
Dimensions:  (time: 1, XC: 2000, YC: 2000, XG: 2000, YG: 2000, Z: 50, Zp1: 51, Zu: 50, Zl: 50)
Coordinates: (12/29)
    iter     (time) int64 7730880
  * time     (time) datetime64[ns] 2001-02-01
  * XC       (XC) >f4 30.01 30.02 30.03 30.04 30.05 ... 0.0 0.0 0.0 0.0 0.0
  * YC       (YC) >f4 10.01 10.02 10.03 10.04 10.05 ... 0.0 0.0 0.0 0.0 0.0
  * XG       (XG) >f4 30.01 30.02 30.03 30.04 30.05 ... 0.0 0.0 0.0 0.0 0.0
  * YG       (YG) >f4 10.01 10.02 10.03 10.04 10.05 ... 0.0 0.0 0.0 0.0 0.0
    ...       ...
    hFacC    (Z, YC, XC) >f4 dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
    hFacW    (Z, YC, XG) >f4 dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
    hFacS    (Z, YG, XC) >f4 dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
    maskC    (Z, YC, XC) bool dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
    maskW    (Z, YC, XG) bool dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
    maskS    (Z, YG, XC) bool dask.array<chunksize=(50, 2000, 2000), meta=np.ndarray>
Data variables:
    *empty*
Attributes:
    Conventions:  CF-1.6
    title:        netCDF wrapper of MITgcm MDS binary data
    source:       MITgcm
    history:      Created by calling `open_mdsdataset(grid_dir=None, iters=[7...

It seems that it is reading all the grid files but finds no variable to read.
Anyone any ideas? Does anyone know if I can add the missing attributes in a mitgcm file using python?

Thank you in advance for your time and help,
Sofi

@rabernat
Copy link
Member

Thanks for sharing @SOFIADAL!

Can you post the raw contents of TT_in_R.0007730880.meta? For example, you could run cat TT_in_R.0007730880.meta from the command line to get this.

Was this file generated by MITgcm or some other process?

@SOFIADAL
Copy link
Author

SOFIADAL commented Jan 12, 2022

Thanks for answering @rabernat .
Here are the contents of the file:

cat TT_in_R.0007730880.meta
 nDims = [   2 ];
 dimList = [
      1999,    1,  1999,
      1999,    1,  1999,
          ];
 format = [ 'float32' ];
 nrecords = [     1 ];
 timeStepNumber = [    7730880 ];

So the files were generated based on an MITgcm model, but it was a reprocessing of basic model data from an individual. So I guess that the final writing of the files were not properly done maybe? Also this data are from a regional model, so not global. These files open without problem in matlab (according to the generator) using some commands similar to the low level utilities mds.rdmds. However the command I showed above in python fails.
This file is not just a temperature file, it is an average mixed layer temperature variable, so I am wondering two things:

  1. Is there a way to add a name to the variable in the file, which currently does not exist?
  2. Is it a problem that this variable is not written inside the available.diagnostics.log file?

I uploaded the file in my dropbox if you have time to have a look at it ( can you let me know if you have access to it?)
https://www.dropbox.com/sh/hwkz4kw4eztxzw1/AADASpFBO3TAXVj8Mi5mXzREa?dl=0

@SOFIADAL
Copy link
Author

SOFIADAL commented Jan 12, 2022

I think I might have found the problem. So an additional "," in the dimList of the meta file is what created the Assertion Error. I manually modified the file, added a few parameters and now I have this:

cat TT_in_R.0007730880.meta
 nDims = [   2 ];
 dimList = [
  1999,    1,  1999,
  1999,    1,  1999
];
 dataprec = [ 'float32' ];
 nrecords = [     1 ];
 timeStepNumber = [    7730880 ];
 nFlds = [    1 ];
 fldList = {'TOTTTEND'};

However, I now have another issue which I suspect has to do with the way these files were created. When I try to open it again it gives me the following error:


In [32]: datatot= xm.open_mdsdataset(data_dir='/home/sofi/Documents/POSTDOC_ATHENS/PROGRAMS/MLHB_DATA',prefix=['TT_in_R'],iters=[7730880],delta_t=90,ref_date='1979-01-15 00:00')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-b8dba3ae9da7> in <module>
----> 1 datatot= xm.open_mdsdataset(data_dir='/home/sofi/Documents/POSTDOC_ATHENS/PROGRAMS/MLHB_DATA',prefix=['TT_in_R'],iters=[7730880],delta_t=90,ref_date='1979-01-15 00:00')

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xmitgcm/mds_store.py in open_mdsdataset(data_dir, grid_dir, iters, prefix, read_grid, delta_t, ref_date, calendar, levels, geometry, grid_vars_to_coords, swap_dims, endian, chunks, ignore_unknown_vars, default_dtype, nx, ny, nz, llc_method, extra_metadata, extra_variables)
    280                          extra_variables=extra_variables)
    281 
--> 282     ds = xr.Dataset.load_store(store)
    283     if swap_dims:
    284         ds = _swap_dimensions(ds, geometry)

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xarray/core/dataset.py in load_store(cls, store, decoder)
    772         if decoder:
    773             variables, attributes = decoder(variables, attributes)
--> 774         obj = cls(variables, attrs=attributes)
    775         obj.set_close(store.close)
    776         return obj

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xarray/core/dataset.py in __init__(self, data_vars, coords, attrs)
    752             coords = coords.variables
    753 
--> 754         variables, coord_names, dims, indexes, _ = merge_data_and_coords(
    755             data_vars, coords, compat="broadcast_equals"
    756         )

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xarray/core/merge.py in merge_data_and_coords(data, coords, compat, join)
    482     explicit_coords = coords.keys()
    483     indexes = dict(_extract_indexes_from_coords(coords))
--> 484     return merge_core(
    485         objects, compat, join, explicit_coords=explicit_coords, indexes=indexes
    486     )

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
    638     assert_unique_multiindex_level_names(variables)
    639 
--> 640     dims = calculate_dimensions(variables)
    641 
    642     coord_names, noncoord_names = determine_coords(coerced)

/opt/anaconda3/3.8.10/envs/MHWENV/lib/python3.9/site-packages/xarray/core/dataset.py in calculate_dimensions(variables)
    206                 last_used[dim] = k
    207             elif dims[dim] != size:
--> 208                 raise ValueError(
    209                     f"conflicting sizes for dimension {dim!r}: "
    210                     f"length {size} on {k!r} and length {dims[dim]} on {last_used!r}"

ValueError: conflicting sizes for dimension 'j': length 1999 on 'TOTTTEND' and length 2000 on {'i': 'i', 'i_g': 'i_g', 'j': 'j', 'j_g': 'j_g', 'k': 'k', 'k_u': 'k_u', 'k_l': 'k_l', 'k_p1': 'k_p1', 'time': 'iter'}

I suspect this means that the file has different dimensions from the grid files xmitgcm reads automatically, right?

@rabernat
Copy link
Member

I'm glad you were able to make progress. You're correct that the meta file was not formatted the way xmitgcm expects. We could imagine trying to update our code to be more flexible. In the meantime, you found a workaround by manually correcting the meta file formatting.

I suspect this means that the file has different dimensions from the grid files xmitgcm reads automatically, right?

This is also my interpretation. If you look at your meta file:

 dimList = [
      1999,    1,  1999,
      1999,    1,  1999,
          ];

and compare it to XC.meta, you will almost certainly find different sizes (i.e. 2000 instead of 1999). Xmitgcm (and Xarray more generally) requires all variables in a dataset to have the same size along shared dimensions.

Do you have any idea why they are different sizes?

One workaround would be to bypass xmitgcm's automatic inference of grid shape and explicitly specify the shape you want. Something like this

data_dir = '/home/sofi/Documents/POSTDOC_ATHENS/PROGRAMS/MLHB_DATA',
datatot= xm.open_mdsdataset(
    data_dir,
    prefix=['TT_in_R'],
    iters=[7730880],
    read_grid=False,
    nx=1999,
    ny=1999,
    delta_t=90,
    ref_date='1979-01-15 00:00'
)

If you want to merge this with any of the rest of your outputs (with nx=2000), you will need to manually extend or pad the data.

@SOFIADAL
Copy link
Author

SOFIADAL commented Jan 12, 2022

Thanks for the help!
I do not have the answer yet as to why the grid size is different from the size of the variable. I suspect this is happening because this is an output of a mixed layer heat budget and this particular term is the total (mixed layer) temperature tendency so when calculated this term, you "lose" one grid point, because of the derivative maybe?

What do you mean that I have to manually "pad" the data?
Also If I disable the grid reading does that mean that I will lose a lot in the functionality of this file? In other words, will I need to specify more details/specifications about the grid in the following code? Isn't there a way to say to python to just choose the first 1999 points of the grid files? I am a beginner in handling of binary, mitgcm data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants