We should consider using HDF5 for reflection storage #1700
Replies: 2 comments 10 replies
-
Just a friendly reminder we already have code that writes out NXreflections from dials input (dials.export format=nxs). It's only missing a few bits, like shoeboxes, and you can add those bits without incurring any real cost in the same way you do in your example code, and without needing to go through a committee. It includes serialization and deserialization. I even did a bunch of work to get it to pass NeXus validation. Your code above essentially duplicates what is in dials/util/nexus/nx_reflections.py. I still think it would be nearly trivial to use a real standard, instead of creating a zillion more files that only dials will ever be able to deal with. |
Beta Was this translation helpful? Give feedback.
-
The real trick here (questions of file format put aside) is that using flex's as_numpy_array intrinsically adds a copy step to serialization. Not a big deal for small datasets, but for datasets with millions of reflections it's a performance issue. Here's the use case:
What's the code under the hood that makes that work?
That last one is a lot like dataset_as_flex in dxtbx/format/nexus.py. Anyway, there's a lot more to think about here. How does writing data in an 'on demand' implementation work? Is there a way to include an index to make selections faster? At what point do we realize we are implementing a relational database and we realize we should use a mysql backend instead? :D |
Beta Was this translation helpful? Give feedback.
-
Using our existing data model, but storing the data in HDF5 rather than message pack, would make interaction with non-DIALS applications trivial - but more importantly allow analysis steps to only read the data they need rather than having to read all the columns. Output files could have references to columns in input in preference to copying (e.g. how index has to read all the shoeboxes and write without using).
Concrete example:
This code will read a standard DIALS reflection table file and write out the same data - shoeboxes and all - in HDF5:
This appears to work for our principle data files:
This would not need an extended discussion of the file format - we can just use what we have already, using the same names which map to the names we use in the code, so that we have a 1:1 mapping to the in-memory representation (rather than tryting to convert to and from NXreflections or something - that could for sure be an export format)
I have made no effort to compress the data but it would be trivial to consider doing so. Also, since this is writing a column at a time we don't have the explosion at the end of a process where multiple copies of data exist to write out in messagepack.
I am certain that going down this road is a lot more flexible than trying to convert to / from NXreflections or whatever.
Beta Was this translation helpful? Give feedback.
All reactions