-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF: Draft of export/import functionality for long-term dataset storage #190
base: master
Are you sure you want to change the base?
Conversation
The Goal is "storing a dataset in the long-term", but no dataset attributes (everything in .a) is stored. A common use case, it would seem to me, is storing volumetric datasets (.nii[.gz]), but then all header information (in .a) is lost. This does not seem desirable to achieve the Goal. Would it be desirable to store at least part of what is in .a as well? Writing Dataset instances using niml.write stores dataset attributes if they are 'simple' (i.e. array-like), and we could consider implementing similar behavior for storing to npz. |
Yes, I agree that implementing some support for simple dataset attributes is necessary. However, that wouldn't help the use case you are describing. It would be easy to store the affine from an input nifti file, but I don't see immediately what we can do about Mapper storage. |
Agreed - even 'simple' stuff like StaticFeatureSelection cannot be directly stored. ChainMapper stuff becomes even more tricky. So let's go back to the premise: "h5save() is not a good format approach for storing a dataset in the long-term.".
|
HDF5 itself is not a problem -- what we put in it is. We do serialization with an HDF5 storage backend. In order to load these things, we rely on a more or less stable software environment. If we ever break the API significantly, we will not be able to load from our HDF5 containers anymore. |
I understand. So the goal is to write Dataset instances in 'stable' and simple manner, which at the same time seems to require that only 'simple' data structures can be stored. (Where arrays and lists are 'simple', but Mapper instances are not). It's nice to have a general solution for all Datasets, but I'm going to suggest something that is less general but may still be used in (probably) >90% of the current use cases:
This has the advantage that surface-based and volumetric datasets are stored already in a standard format. |
Paraphrasing would you said: We need additional functions to export datasets into formats that are best suited for a specific domain. Doing NIfTI with extensions makes a lot of sense, NIML and GIFTI likewise. But I do see that in addition to the basic export into NPZ. If you look at the implementation, you can see that it comes at essentially zero cost. It is just a formal/executable specification. I anticipate an export to NIfTI/NIML/GIFTI to be relatively simple too, hence I don't see why we shouldn't have all of that -- if somebody writes it down ;-) |
I may give this a try for NIFTI. For this to work, however, it would be useful if storing .samples (and maybe some fields in .fa) can be switched off. When storing in NIFTI, for example, it would not make much sense to store .samples (as they are in the image), and probably .fa.voxel_indices is somewhat (though not completely) redundant. |
@nno Maybe we should have a less generic function than map2nifti that tries to push as much as possible into a NIfTI file -- without relying on any kind of serialization inside. Maybe something like this would work: We inspect a datasets's mapper and condense it into a simplified version that only does the necessary shape transformations and feature selections. This would get stored in a way that doesn't require serialization. My current concept of that would be to come up with custom reduce implementations for the most relevant mappers. Static feature selection would be stored as a mapper label (which mapper is this) and the respective index array. Likewise for boxcar (maybe flattening too). With this mapper we could forward-map samples from the regular nifti array upon load. Dataset attributes (fa and sa) could be stored unmapped in nifti header extensions and placed back into the attribute collections after loading and mapping the samples. Once we have this functionality, we can easily place these simplified bits into an HDF5 file too -- without requiring serialization anymore (in most cases). Likewise, we can have storage in NumPy's NPZ format without much additional effort -- in which case we could keep HDF5 an optional dependency (good thing given its weight IMHO). I would not touch the array(object) use case for now -- I don't think that is used much in common scenarios. The only thing that may need a closer look it efficient storage of array(str). HDF5 has a char datatype for datasets, but I can't recall whether that would work nicely out of the box. |
However I wonder what then would be suitable representation for such mappers. The flatten mapper has a reshape operation, while static feature selection needs a list of indices. Can we come up with something general enough to cover most use cases? And what to do when a custom or unsupported mapper is used?
You mean storing the mapper in a nifti extension field?
What do you mean by 'unmapped'? As standard numpy arrays?
Absolutely, that would be nice. Some inspiration may come from mvpa2.base.niml, which already attempts to convert most common .fa, .sa and .a to a string or binary representation. That code is quite NIML-specific, but maybe something along those lines can be generalized to cover most use-cases.
Is that not supported well by numpy? |
This is not complete -- a proposal for discussion.
Idea: With our serialization magic, h5save() is not a good format approach for storing a dataset in the long-term.
This proposal provides two functions that can load/save datasets into NumPy's NPZ format (compressed/uncompressed). Naturally only samples, sa and fa collections are stored, hence no mappers or other attributes.
Documentation will be completed once discussed.