Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a neutral format to have lossless interface with JSON, scipp, Astropy, pandas #8927

Open
loco-philippe opened this issue Apr 11, 2024 · 4 comments

Comments

@loco-philippe
Copy link
Contributor

Is your feature request related to a problem?

Each tool has a specific structure for processing multidimensional data with the following consequences:

  • interfaces dedicated to each tool,
  • partially processed data,
  • no unified representation of data structures

Describe the solution you'd like

The proposed format (see jupyter notebook, github repository, PyPI package ) is based on the following principles:

  • neutral format available for tabular or multidimensional tools (e.g. Numpy, pandas, xarray, scipp, astropy),
  • taking into account a wide variety of data types as defined in NTV format,
  • high interoperability: reversible (lossless round-trip) interface with tabular or multidimensional tools,
  • reversible and compact JSON format,
  • Ease of sharing and exchanging multidimensional and tabular data,

Describe alternatives you've considered

No response

Additional context

numpy/numpy#12481 (comment)
astropy/astropy#16286
scipp/scipp#3422

Copy link

welcome bot commented Apr 11, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas
Copy link
Contributor

It's not clear to me what changes you're asking for in xarray. If you want to create a new on-disk storage format you can, and you can teach xarray to read it using the backend entrypoint system. Are you asking for something that falls outside of that framework?

@loco-philippe
Copy link
Contributor Author

Thank you @TomNicholas for your quick response.

Currently, the interface between Xarray and other multidimensional tools like scipp or NDData only process part of the data because the internal structures of each tool are different.

To have reversible 'lossless round-trip' interfaces it is necessary to define a common data structure and a mapping between this structure and the structure of the tool (here Xarray).

This is what was defined in the proposed format and implemented in the indicated package. This shows for example that an Xarray Dataset can be transformed reversibly into a Scipp Dataset and vice versa or even into JSON data in an equally reversible manner.

To be clearer, my requests for Xarray are as follows:

  • does Xarray wish to participate in the definition (or validation) of this common data structure (so as to ensure that it covers all the developments envisaged for Xarray)?
  • is Xarray interested in integrating the interface defined towards this structure (or is it better to include it in a third party)?
  • is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?
  • does Xarray have use cases associated with interfaces between tools (or is this to do with Xarray discussions)?

@TomNicholas
Copy link
Contributor

TomNicholas commented Apr 11, 2024

Thanks for the clarification @loco-philippe .

Xarray Dataset can be transformed reversibly into a Scipp Dataset and vice versa

That's cool to know!

I'll attempt to answer these questions, but others feel free to correct me.

does Xarray wish to participate in the definition (or validation) of this common data structure (so as to ensure that it covers all the developments envisaged for Xarray)?

is Xarray interested in integrating the interface defined towards this structure (or is it better to include it in a third party)?

I don't really think we need to be active participants until you ask for a specific change in xarray. Our data model is well-defined, and would need a very good reason to change.

is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?

Note that xarray maps well to the zarr format, which already stores all metadata in JSON files. If the numerical data arrays themselves can also be serialized to JSON (e.g. through numpy/numpy#12481), then you have a JSON representation of an entire xarray.Dataset right there.

is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?

Xarray deliberately tries to make it easy for third parties to write code to serialize/deserialize to any data format they like. Again see our backend entrypoint system. I don't see a need to add any Dataset.to_new_format() or open_new_format_as_dataset functions to xarray, because these can live in your third party library (possibly as a BackendEntryPoint subclass). Once the new format becomes popular then we could consider accepting a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants