Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data format(s) #1

Open
guiomar opened this issue Sep 30, 2020 · 10 comments
Open

Data format(s) #1

guiomar opened this issue Sep 30, 2020 · 10 comments

Comments

@guiomar
Copy link

guiomar commented Sep 30, 2020

Hi!

I'll share here some of the discussions we have on the google doc related to BEP021, so we can keep track and they don't get forgotten once we resolve the comments.

One of the first things we need to agree is what data formats should we use to store the resulting matrices of data preprocessing.

This is what we currently have:

.mat
PROS:
-Open specification
-Well supported I/O in both Matlab and Python
CONS:
-Proprietary format
-Allows for highly complex data structures that might need further documentation v7.3 is which is based on HDF5 format (not proprietary) is not supported in Python

.npy
PROS:
-Open specification
-Well supported I/O in Python and C++
-Allows only n-dimensional arrays, limited complexity and thus not easily abused
CONS:
-Experimental support for Matlab

.txt
PROS:
-Simple and easy I/O
CONS:
-Large memory footprint, inaccurate numeric representation

.h5
-See blog post for detailed discussion

@ChristophePhillips commented: Any chance of using the NIfTI format? It was devised for images but can easily any type of 2D/3D/4D signals... and it's typically well interfaced.

@arnodelorme commented: Consider adding .set EEGLAB format and .vhdr Brain Vision Exchange Format which both support data epochs definition and are also both included in BIDS raw EEG data definition.

@guiomar
Copy link
Author

guiomar commented Sep 30, 2020

See also discussion here:
bids-standard/bids-specification#197

@CPernet
Copy link
Collaborator

CPernet commented Sep 30, 2020

summary of #197

  1. HDF5 not recommended by most
  2. in general if data can be in the same format as raw then stay in the same format (segment and average don't need change)
  3. change of file format only if the format of raw cannot support it ; if so nifti and cifti which are already BIDS supported can probably cover most cases up to 8D - with the advantages that the first 4D are fixed x,y,z,t and other dim to specify
  4. if nifti and cifti don't work, then so far 'we' seem resolved to use .mat and .npy (but then I suggest we also support R if we go down the road of supporting computing platform format)

I think the ephys we should discuss point 4 only -- and see if we can agree for all derivatives on points 1/2/3/ within issue 197 @robertoostenveld @arnodelorme @sappelhoff

@guiomar
Copy link
Author

guiomar commented Sep 30, 2020

Thanks a lot @CPernet for the nice summary!!

@ChristophePhillips
Copy link

To chip in on this, what about the BrainVision data format ? OK it comes from a company but AKAIK it's open, simple and sufficiently flexible. Two text files (.vhdr and .vmrk) with the header (i.e. data description) and markers (i.e. any "event") informations, plus a simple binary file (multiplexed) with the signals. Easy to read, easy to write.

And it's already accepted in BIDS-EEG.

@CPernet
Copy link
Collaborator

CPernet commented Oct 1, 2020

see above: '2 in general if data can be in the same format as raw then stay in the same format'
--> so if you used .vhdr and .vmrk keep doing so - I don't see where is the question @ChristophePhillips

@CPernet
Copy link
Collaborator

CPernet commented Jan 27, 2021

oh oh HDF5 might still be on the table Teeters, J., Benda, J., Davison, A., Eglen, S., Gerkin, R. C., Grethe, J., … Wark, B. (2016). Requirements for storing electrophysiology data. Retrieved from http://arxiv.org/abs/1605.07673 ... INCF stuff

@dorahermes
Copy link
Member

It seems like just stating 'HDF5' is underspecified. Moreover, NWB is already accepted (but not supported) in BIDS-iEEG and based on HDF5.

For parallel compute (and clinical use cases) MEF3 is also accepted in BIDS-iEEG (and I am currently very happy working with MEF3, since it allows me to easily and efficiently work with large iEEG datasets in nice small chunks).

@robertoostenveld
Copy link
Collaborator

HDF5 is also used by MATLAB, and hence implicitly supported in BIDS-EEG, as that allows for EEGLAB .set datasets (which are .mat files in disguise, and hence HDF5). HDF5 is also used in SNIRF, which is the format considered for BIDS-NIRS https://bids.neuroimaging.io/bep030.

In all cases (nwb, eeglab, snirf) there is a clear specification on top of HDF5 that is defined and maintained outside of the BIDS ecosystem.

@yarikoptic
Copy link

FWIW, HDF5 I think needs more of pros listed in its item in the OD. Some "pros":

  • used as the base format for BRAIN Initiative NWB which is also used within BIDS and might later be utilized even more in the course BEP032
  • It gets increasingly valuable to have a format which could be accessed client-side within browsers. There is https://github.com/brainsatplay/webnwb which uses https://github.com/garrettmflynn/hdf5-io
  • supports various compression and chunking schemes for efficient storage and transfer
  • "pros+cons": by itself too flexible. Requires establishing formalized schema on top of it for our purposes here.

@CPernet
Copy link
Collaborator

CPernet commented Jul 29, 2023

now formalized into the derivatives guidelines -- link to follow
HDF5 and zarr supported when same format or tsv not possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants