Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silently fails when providing incorrect schema for WEIGHT column #268

Closed
landmanbester opened this issue Oct 31, 2022 · 1 comment · Fixed by #329
Closed

Silently fails when providing incorrect schema for WEIGHT column #268

landmanbester opened this issue Oct 31, 2022 · 1 comment · Fixed by #329
Labels

Comments

@landmanbester
Copy link
Collaborator

  • dask-ms version: 0.2.14
  • Python version: 3.8
  • Operating System: Ubuntu20.04

Description

I just noticed that xds_from_ms silently fails to add the WEIGHT column to the dataset if the provided schema does not contain a tuple for 'dims'. Strangely, it also overwrites the dimension names of known columns like FLAG and DATA, regardless of whether 'dims' is a tuple or not. Seems to work as expected if not providing a schema.

What I Did

Running

from daskms import xds_from_ms
schema = {}
schema['WEIGHT'] = {'dims': ('corr')}  # note the mistake here, 'dims' should be a tuple
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)

will produce

In [38]: xds
Out[38]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, FLAG-1: 32, FLAG-2: 4, DATA-1: 32, DATA-2: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, FLAG-1, FLAG-2, DATA-1, DATA-2
 Data variables:
     FLAG     (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
     DATA     (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

Note the dimension names of DATA and FLAG. When giving schema a tuple for 'dims' i.e.

schema['WEIGHT'] = {'dims': ('corr',)}
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)

we get

In [41]: xds
Out[41]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, FLAG-1: 32, FLAG-2: 4, corr: 4, DATA-1: 32, DATA-2: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, FLAG-1, FLAG-2, corr, DATA-1, DATA-2
 Data variables:
     FLAG     (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
     WEIGHT   (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
     DATA     (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

Now WEIGHT is there but DATA and FLAG still have the wrong dimension names. If no schema is given, we get

In [43]: xds
Out[43]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, chan: 32, corr: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, chan, corr
 Data variables:
     FLAG     (row, chan, corr) bool dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
     WEIGHT   (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
     DATA     (row, chan, corr) complex64 dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.

@sjperkins
Copy link
Member

which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.

Thanks for the very thorough reproducer @landmanbester.

Without digging into the code in too much detail, I'd speculate that ("corr") translates to "corr" which is then treated as an Iterable so the the dims end up being evaluated as ("c", "o", "r", "r"). I'd need to dig more to understand why FLAG and DATA aren't getting assigned the default dimension names in this case.

Probably related:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants