NF: Draft of export/import functionality for long-term dataset storage #190

mih · 2014-02-23T21:04:10Z

This is not complete -- a proposal for discussion.

Idea: With our serialization magic, h5save() is not a good format approach for storing a dataset in the long-term.
This proposal provides two functions that can load/save datasets into NumPy's NPZ format (compressed/uncompressed). Naturally only samples, sa and fa collections are stored, hence no mappers or other attributes.

Documentation will be completed once discussed.

coveralls · 2014-02-23T21:26:47Z

Coverage remained the same when pulling e03fb27 on hanke:ds2npz into 5cc421d on PyMVPA:master.

nno · 2014-02-24T10:11:03Z

The Goal is "storing a dataset in the long-term", but no dataset attributes (everything in .a) is stored.

A common use case, it would seem to me, is storing volumetric datasets (.nii[.gz]), but then all header information (in .a) is lost. This does not seem desirable to achieve the Goal.

Would it be desirable to store at least part of what is in .a as well? Writing Dataset instances using niml.write stores dataset attributes if they are 'simple' (i.e. array-like), and we could consider implementing similar behavior for storing to npz.

mih · 2014-02-24T11:16:04Z

Yes, I agree that implementing some support for simple dataset attributes is necessary. However, that wouldn't help the use case you are describing. It would be easy to store the affine from an input nifti file, but I don't see immediately what we can do about Mapper storage.

nno · 2014-02-24T11:47:43Z

Agreed - even 'simple' stuff like StaticFeatureSelection cannot be directly stored. ChainMapper stuff becomes even more tricky.

So let's go back to the premise: "h5save() is not a good format approach for storing a dataset in the long-term.".

What is it lacking?
In which use cases would one prefer .npz over .hdf5?

mih · 2014-02-24T12:15:27Z

HDF5 itself is not a problem -- what we put in it is. We do serialization with an HDF5 storage backend. In order to load these things, we rely on a more or less stable software environment. If we ever break the API significantly, we will not be able to load from our HDF5 containers anymore.

nno · 2014-02-24T13:00:21Z

I understand. So the goal is to write Dataset instances in 'stable' and simple manner, which at the same time seems to require that only 'simple' data structures can be stored. (Where arrays and lists are 'simple', but Mapper instances are not).

It's nice to have a general solution for all Datasets, but I'm going to suggest something that is less general but may still be used in (probably) >90% of the current use cases:

for surface-based datasets use the current .niml functionality. It already stores 'simple' attributes (.sa, .fa, .a). Alternatively we can look into using GIFTI.
for volumetric datasets, implement using nifti-extensions to store the 'simple' data in .sa, .fa and .a.
other datasets (meeg) are not handled in this proposal (or they could be handled using the code in this PR).

This has the advantage that surface-based and volumetric datasets are stored already in a standard format.
Essentially this proposal would require support for reading and writing NIFTI extension data, which would be relatively straightforward I think.

mih · 2014-02-24T13:35:43Z

Paraphrasing would you said: We need additional functions to export datasets into formats that are best suited for a specific domain. Doing NIfTI with extensions makes a lot of sense, NIML and GIFTI likewise. But I do see that in addition to the basic export into NPZ. If you look at the implementation, you can see that it comes at essentially zero cost. It is just a formal/executable specification.

I anticipate an export to NIfTI/NIML/GIFTI to be relatively simple too, hence I don't see why we shouldn't have all of that -- if somebody writes it down ;-)

nno · 2014-02-24T15:57:11Z

I anticipate an export to NIfTI/NIML/GIFTI to be relatively simple too, hence I don't see why we shouldn't have all of that -- if somebody writes it down

I may give this a try for NIFTI.

For this to work, however, it would be useful if storing .samples (and maybe some fields in .fa) can be switched off. When storing in NIFTI, for example, it would not make much sense to store .samples (as they are in the image), and probably .fa.voxel_indices is somewhat (though not completely) redundant.

mih · 2014-03-09T07:44:44Z

@nno Maybe we should have a less generic function than map2nifti that tries to push as much as possible into a NIfTI file -- without relying on any kind of serialization inside. Maybe something like this would work:

We inspect a datasets's mapper and condense it into a simplified version that only does the necessary shape transformations and feature selections. This would get stored in a way that doesn't require serialization. My current concept of that would be to come up with custom reduce implementations for the most relevant mappers. Static feature selection would be stored as a mapper label (which mapper is this) and the respective index array. Likewise for boxcar (maybe flattening too).

With this mapper we could forward-map samples from the regular nifti array upon load. Dataset attributes (fa and sa) could be stored unmapped in nifti header extensions and placed back into the attribute collections after loading and mapping the samples.

Once we have this functionality, we can easily place these simplified bits into an HDF5 file too -- without requiring serialization anymore (in most cases). Likewise, we can have storage in NumPy's NPZ format without much additional effort -- in which case we could keep HDF5 an optional dependency (good thing given its weight IMHO).

I would not touch the array(object) use case for now -- I don't think that is used much in common scenarios. The only thing that may need a closer look it efficient storage of array(str). HDF5 has a char datatype for datasets, but I can't recall whether that would work nicely out of the box.

nno · 2014-03-10T13:39:22Z

My current concept of that would be to come up with custom reduce implementations for the most relevant mappers.

we probably want FlattenMapper to work for sure, as this is an important part for fMRI datasets.
it would need some smartness to deal with ChainMappers, for example when loading an fMRI dataset with a mask (flatten followed by static feature selection). Or maybe the ChainMapper's reduce function can be so smart to chain the reduce-d representation of multiple mappers.

However I wonder what then would be suitable representation for such mappers. The flatten mapper has a reshape operation, while static feature selection needs a list of indices. Can we come up with something general enough to cover most use cases? And what to do when a custom or unsupported mapper is used?

With this mapper we could forward-map samples from the regular nifti array upon load.

You mean storing the mapper in a nifti extension field?
If it's for NIFTI only then we could drop supporting the flatten mapper (so then my comments above re flattening are not relevant anymore), and then the main job of the mapper is just masking, no?

Dataset attributes (fa and sa) could be stored unmapped in nifti header extensions and placed back into the attribute collections after loading and mapping the samples.

What do you mean by 'unmapped'? As standard numpy arrays?

Once we have this functionality, we can easily place these simplified bits into an HDF5 file too -- without requiring serialization anymore (in most cases). Likewise, we can have storage in NumPy's NPZ format without much additional effort -- in which case we could keep HDF5 an optional dependency (good thing given its weight IMHO).

Absolutely, that would be nice. Some inspiration may come from mvpa2.base.niml, which already attempts to convert most common .fa, .sa and .a to a string or binary representation. That code is quite NIML-specific, but maybe something along those lines can be generalized to cover most use-cases.

The only thing that may need a closer look it efficient storage of array(str)

Is that not supported well by numpy?

NF: Draft of export/import functionality for long-term dataset storage

e03fb27

mih added this to the Release 2.3 milestone Feb 24, 2014

mih modified the milestone: Release 2.3 Feb 25, 2014

yarikoptic added this to the 2.4: New feature release milestone Feb 26, 2014

yarikoptic force-pushed the master branch from eaaaf14 to 439e744 Compare September 26, 2014 23:38

mih removed this from the Release 2.4 milestone May 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF: Draft of export/import functionality for long-term dataset storage #190

NF: Draft of export/import functionality for long-term dataset storage #190

mih commented Feb 23, 2014

coveralls commented Feb 23, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Mar 9, 2014

nno commented Mar 10, 2014

NF: Draft of export/import functionality for long-term dataset storage #190

Are you sure you want to change the base?

NF: Draft of export/import functionality for long-term dataset storage #190

Conversation

mih commented Feb 23, 2014

coveralls commented Feb 23, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Feb 24, 2014

nno commented Feb 24, 2014

mih commented Mar 9, 2014

nno commented Mar 10, 2014