Reflection and model serialisation: move to single file (reflex) and back end (HDF5) #1407
Replies: 8 comments
-
User story 1: $ dials.import *.cbf # -> mydata.rflx
$ dials.find_spots mydata.rflx minimum_spot_size=default
$ dials.find_spots mydata.rflx minimum_spot_size=smol
$ dials.index mydata.rflx # what does this do? User story 2: $ dials.integrate mydata.rflx
$ dials.scale mydata.rflx
...this is taking ages...
^C
$ # can we *guarantee* that mydata.rflx is consistent? |
Beta Was this translation helpful? Give feedback.
-
From what I can tell SWMR will not solve this problem as we would be adding new groups not just extending existing ones: https://support.hdfgroup.org/HDF5/docNewFeatures/SWMR/HDF5_SWMR_Users_Guide.pdf |
Beta Was this translation helpful? Give feedback.
-
Concept: every time you look at images (which coincides precisely with creating reflections) you make a new There are details here e.g. if you want to "correct" something in |
Beta Was this translation helpful? Give feedback.
-
Revisiting this after the conversation this afternoon, I realise that the questions of whether we have the data and models in the same files, and whether we update the input file with new information or create an new output file could be dependent on user needs and decided at run time provided that the file IO runs through a single interface. For example the default mode could be similar to the current usage pattern, and we could have an alternate mode for high throughput use which behaves as above. We could also consolidate the file handling as a first step for this using the current formats as a relatively harmless first step. This will necessitate dismantling the option parser but that would be a good thing anyway. If we made sure that programs were explicit about what data are needed we could also copy or link untouched data behind the scenes which would remove the need to hold reflection shoeboxes in memory during indexing, for example. |
Beta Was this translation helpful? Give feedback.
-
I think this could work |
Beta Was this translation helpful? Give feedback.
-
Question which occurs to me here is: do programs routinely edit reflection tables e.g. add / remove / rearrange rather than just annotating. Could make for a need for expensive sorting to "update" input tables. |
Beta Was this translation helpful? Give feedback.
-
Yah, the new xfel merging program uses MPI to redistribute reflection tables over many ranks, while adding, modifying, removing and sorting the tables many times. Example. I imagine dials.scale is similar. |
Beta Was this translation helpful? Give feedback.
-
On the topic of HDF5 back-end - I think there is a trivial implementation of this we could consider where we use HDF5 as a container with our current data model i.e. all of our standard naming, rather than getting bogged down in ontological / standards discussions - I will start a thread on precisely this |
Beta Was this translation helpful? Give feedback.
-
Introduction
Currently in DIALS we have two data formats for intermediate data: JSON for experiment models and MessagePack for reflection data. This breaks down slightly with the addition of e.g. masks which are stored externally and referenced from the experiment models.
#1238
and related
#1151
This means if we relocate the data to another home the mask references are incorrect which causes issues in debugging as detailed above.
The usual motif of a dials program is
Which means if we encode the mask with the experiment we will have multiple copies of it.
In addition, for reflection data we have many files with the same data in, which is expensive in terms of disk space:
for example (particularly as most of those bytes are the same) - simple efforts to use e.g. HDF5 for this would be thwarted as even with external references we still have the issue that the mask file is an external link and therefore needed to be copied.
Proposal
We break this assumption:
and instead formulate as
i.e. we create a file which is (i) HDF5 and (ii) contains reflections and experiments, and carry this from the outset. Nuance: it is quite possible that we want to in routine dials processing create two such files, one from spot finding and one from integration i.e. every time we look at the images or create new reflections we create a new file. Then once we have a file we extend the data therein with new information, viz:
In this case foo.rflx contains e.g.
etc.
Then
will add a new group /data to the working file e.g. with
etc.
Now this will be made more complex as later on we may want to move these reflections to a different experiment: to achieve this an experiment which is derived from another will need to keep track of the source hash and the data will have to keep an additional column as to which experiment the reflections actually belong to. Reflections must belong to exactly one experiment - this is starting to get to be too much detail but gives a sense of the thought process behind this.
This touches on an existing issue: extending experiments with new ones which also contain crystals:
#1029
However in this case this works rather gracefully. In indexing we add a new experiment above for each crystal lattice we find and reassign the experiment indices. At no point will we be deleting reflections or data so this should be efficient. By keeping everything in one file we can have masks encoded in the data file with no need to copy - they can be updated by e.g. spot finding. This also has the nice feature that we can encode the history into the file as well e.g. how we got to here and what steps were performed with what input.
Why is integration special?
Adding new reflections in prediction is straightforward: we can compute the number of reflections we will create and allocate space. Once we are done integrating however there is not much need to have access to the data which gave rise to those integration results in the subsequent analysis - where more columns will be added to the integrated data.
Workflow therefore looks like:
In addition add
or whatever which would make foo-models.reflx with just the /models group in - useful for diagnostics where the data are not needed.
May need to extend
dials.import
to be able to read "old" (i.e. current) formats into this new format.Side goals / questions:
Work
dials.index does not care about reflection shoeboxes
)It does not escape my attention that doing this would resolve the eternal “what should ccp4 do for a future reflection file” problem.
Motivation
Grey-Area manual-000 :) $ du -hs *refl
3.6G indexed.refl
2.6G integrated.refl
3.6G refined.refl
2.9G scaled.refl
3.3G strong.refl
2.6G symmetrized.refl
Most of this is duplicate information, and reading / writing in this format is expensive: simply writing new columns where necessary could be a lot more efficient, and only loading the data you need to read rather than having to read everything.
Storage of reflection data as HDF5 is much better suited to working with HPC - accessing massive amounts of data by indexing into an HDF5 file is more efficient than reading all the data in one place and scattering across e.g. MPI processes.
Annotations
rflx
data - we don't need to worry about people getting in a knot. We could even add all the possible reindexing operations to therflx
file so that one may simply be selected for reindexing.Conclusion
The text above is a proposal - a suggestion for a direction we could consider which I feel would aid our handling of "big data" experiments. If this proposal is agreed it will need to come forward as a project to be signed off at the development sites as it will be expensive.
Beta Was this translation helpful? Give feedback.
All reactions