Reflection and model serialisation: move to single file (reflex) and back end (HDF5) #1407

graeme-winter · 2020-09-16T10:28:15Z

graeme-winter
Sep 16, 2020
Maintainer

Introduction

Currently in DIALS we have two data formats for intermediate data: JSON for experiment models and MessagePack for reflection data. This breaks down slightly with the addition of e.g. masks which are stored externally and referenced from the experiment models.

#1238

and related

#1151

This means if we relocate the data to another home the mask references are incorrect which causes issues in debugging as detailed above.

The usual motif of a dials program is

  dials.program input.refl input.expt output.refl output.expt

Which means if we encode the mask with the experiment we will have multiple copies of it.

In addition, for reflection data we have many files with the same data in, which is expensive in terms of disk space:

cs03r-sc-serv-36 manual-000 :) $ du -hs *refl
3.4G	strong.refl
3.6G	indexed.refl
3.6G	refined.refl
2.7G	integrated.refl
2.7G	symmetrized.refl
2.9G	scaled.refl

for example (particularly as most of those bytes are the same) - simple efforts to use e.g. HDF5 for this would be thwarted as even with external references we still have the issue that the mask file is an external link and therefore needed to be copied.

Proposal

We break this assumption:

  dials.program input.refl input.expt output.refl output.expt

and instead formulate as

  dials.program data.rflx

i.e. we create a file which is (i) HDF5 and (ii) contains reflections and experiments, and carry this from the outset. Nuance: it is quite possible that we want to in routine dials processing create two such files, one from spot finding and one from integration i.e. every time we look at the images or create new reflections we create a new file. Then once we have a file we extend the data therein with new information, viz:

  dials.import foo.h5 -> foo.rflx

In this case foo.rflx contains e.g.

/models/detector/[detector model]
/models/beam/…

etc.

/models/experiment/<hash>/detector -> softlink inside the file to correct detector etc.

Then

  dials.find_spots foo.rflx

will add a new group /data to the working file e.g. with

/data/<hash>/xyz.px.value

etc.

Now this will be made more complex as later on we may want to move these reflections to a different experiment: to achieve this an experiment which is derived from another will need to keep track of the source hash and the data will have to keep an additional column as to which experiment the reflections actually belong to. Reflections must belong to exactly one experiment - this is starting to get to be too much detail but gives a sense of the thought process behind this.

This touches on an existing issue: extending experiments with new ones which also contain crystals:
#1029

However in this case this works rather gracefully. In indexing we add a new experiment above for each crystal lattice we find and reassign the experiment indices. At no point will we be deleting reflections or data so this should be efficient. By keeping everything in one file we can have masks encoded in the data file with no need to copy - they can be updated by e.g. spot finding. This also has the nice feature that we can encode the history into the file as well e.g. how we got to here and what steps were performed with what input.

Why is integration special?

Adding new reflections in prediction is straightforward: we can compute the number of reflections we will create and allocate space. Once we are done integrating however there is not much need to have access to the data which gave rise to those integration results in the subsequent analysis - where more columns will be added to the integrated data.

Workflow therefore looks like:

import -> create rflx
find_spots -> add spots to rflx
index -> create new experiments in rflx; assign indices & experiments
refine -> create scan varying models; update experiements
integrate -> create new rflx - copy /models group across, making new /data groups
symmetry -> modify indices / experiments
scale -> add new columns with scale factors and update flags
export -> push the whole lot out to MTZ or whatever

In addition add

  dials.extract foo.rflx /models

or whatever which would make foo-models.reflx with just the /models group in - useful for diagnostics where the data are not needed.

May need to extend dials.import to be able to read "old" (i.e. current) formats into this new format.

Side goals / questions:

Import should copy everything which is needed into the experiments to the only reasons the data are ever touched again is to extract the pixels in find_spots and integrate
If we are doing this, does import do anything useful? E.g. we could just start at find_spots (which could make things more obvious to users)
If the raw data are moved should structure things such that only one reference needs to be updated
We need to think about how we efficiently serialise and deserialise these - reflex object will help here as this can host both the models and reflections
Can we have /data//… point at a VDS which identifies the reflections which belong to the experiment in preference to defining a column which is the experiment id? Do we want to do this?
Should have a think about how this will work for serial experiments / massive numbers of experiments / crystals etc. -> having the input experiment hash above point at some variant on the image set would do it
Add “live” editor for rflx file which will allow a json-like representing to be shown in ${EDITOR} and then saved back to the original file to e.g. change the rotation axis but ensure that the resulting file is valid

Work

Define the HDF5 representation of the data and models -> as close as possible to a direct map of what we have at the moment i.e. ideally maintaining the same names etc. for simplicity (can revisit the names etc. post implementation)
Use these as back ends to reflection and experiment files
Implement "lazy loading" of reflection data i.e. programs only reading the data they need (e.g. dials.index does not care about reflection shoeboxes)
Define reflex object to manage these together - possibly in parallel with other tasks
Figure out how to change the (de)serialisation to update existing files rather than writing new files - usually HDF5-based programs work with essentially a memory-mapped version of the data which is different to our use case
Move programs to use reflex format
Move programs to use input and output files as the same file
Move deserialisation away from re-reading the images to depend only on the models in the experiment description
Implement experiment editor i.e. present user with simple-to-edit format in vi and then validate before rewriting experiment to .expt or equivalent file (valid today)
Add the concept of “history” to the existing data model i.e. when you write an experiment file, annotate it with the command lime which generated it, and if reading a file save this history to the next file in the chain, in preparation for the proposals above

It does not escape my attention that doing this would resolve the eternal “what should ccp4 do for a future reflection file” problem.

Motivation

Grey-Area manual-000 :) $ du -hs *refl
3.6G indexed.refl
2.6G integrated.refl
3.6G refined.refl
2.9G scaled.refl
3.3G strong.refl
2.6G symmetrized.refl

Most of this is duplicate information, and reading / writing in this format is expensive: simply writing new columns where necessary could be a lot more efficient, and only loading the data you need to read rather than having to read everything.

Storage of reflection data as HDF5 is much better suited to working with HPC - accessing massive amounts of data by indexing into an HDF5 file is more efficient than reading all the data in one place and scattering across e.g. MPI processes.

Annotations

dials.refine_bravais_settings likely to be an issue - except we can just output a reindexing operation to use and apply this to the rflx data - we don't need to worry about people getting in a knot. We could even add all the possible reindexing operations to the rflx file so that one may simply be selected for reindexing.
this is a massive body of work, so to minimise the cost would aim in the first instance to migrate our existing data model over to a different container format
this probably implies a revisiting of the dials option parser

Conclusion

The text above is a proposal - a suggestion for a direction we could consider which I feel would aid our handling of "big data" experiments. If this proposal is agreed it will need to come forward as a project to be signed off at the development sites as it will be expensive.

Anthchirp · 2020-09-16T10:38:35Z

Anthchirp
Sep 16, 2020
Maintainer

User story 1:

$ dials.import *.cbf # -> mydata.rflx
$ dials.find_spots mydata.rflx minimum_spot_size=default
$ dials.find_spots mydata.rflx minimum_spot_size=smol
$ dials.index mydata.rflx  # what does this do?

User story 2:

$ dials.integrate mydata.rflx
$ dials.scale mydata.rflx
...this is taking ages...
^C
$  # can we *guarantee* that mydata.rflx is consistent?

0 replies

graeme-winter · 2020-09-16T15:01:16Z

graeme-winter
Sep 16, 2020
Maintainer Author

User story 2:

$ dials.integrate mydata.rflx
$ dials.scale mydata.rflx
...this is taking ages...
^C
$  # can we *guarantee* that mydata.rflx is consistent?

From what I can tell SWMR will not solve this problem as we would be adding new groups not just extending existing ones: https://support.hdfgroup.org/HDF5/docNewFeatures/SWMR/HDF5_SWMR_Users_Guide.pdf

0 replies

graeme-winter · 2020-09-17T13:06:36Z

graeme-winter
Sep 17, 2020
Maintainer Author

User story 1:

$ dials.import *.cbf # -> mydata.rflx
$ dials.find_spots mydata.rflx minimum_spot_size=default
$ dials.find_spots mydata.rflx minimum_spot_size=smol
$ dials.index mydata.rflx  # what does this do?

Concept: every time you look at images (which coincides precisely with creating reflections) you make a new rflx file -> import could go away or be optional, spot finding would create spots.rflx or something but the second one would create a new file. In indexing etc. you never create reflections, only annotate them in which case the re-annotation would overwrite the old (say). Integration is the next time we make reflections, so would be the next point where we create a file.

There are details here e.g. if you want to "correct" something in spots.rflx you do this how?

0 replies

graeme-winter · 2020-09-17T16:56:55Z

graeme-winter
Sep 17, 2020
Maintainer Author

Revisiting this after the conversation this afternoon, I realise that the questions of whether we have the data and models in the same files, and whether we update the input file with new information or create an new output file could be dependent on user needs and decided at run time provided that the file IO runs through a single interface.

For example the default mode could be similar to the current usage pattern, and we could have an alternate mode for high throughput use which behaves as above.

We could also consolidate the file handling as a first step for this using the current formats as a relatively harmless first step.

This will necessitate dismantling the option parser but that would be a good thing anyway.

If we made sure that programs were explicit about what data are needed we could also copy or link untouched data behind the scenes which would remove the need to hold reflection shoeboxes in memory during indexing, for example.

0 replies

ndevenish · 2020-09-18T13:24:57Z

ndevenish
Sep 18, 2020
Maintainer

For example the default mode could be similar to the current usage pattern, and we could have an alternate mode for high throughput use which behaves as above.

I think this could work

0 replies

graeme-winter · 2020-09-22T04:49:22Z

graeme-winter
Sep 22, 2020
Maintainer Author

Question which occurs to me here is: do programs routinely edit reflection tables e.g. add / remove / rearrange rather than just annotating. Could make for a need for expensive sorting to "update" input tables.

0 replies

phyy-nx · 2020-09-23T19:05:48Z

phyy-nx
Sep 23, 2020
Maintainer

Yah, the new xfel merging program uses MPI to redistribute reflection tables over many ranks, while adding, modifying, removing and sorting the tables many times. Example. I imagine dials.scale is similar.

0 replies

graeme-winter · 2021-05-12T15:04:49Z

graeme-winter
May 12, 2021
Maintainer Author

On the topic of HDF5 back-end - I think there is a trivial implementation of this we could consider where we use HDF5 as a container with our current data model i.e. all of our standard naming, rather than getting bogged down in ontological / standards discussions - I will start a thread on precisely this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reflection and model serialisation: move to single file (reflex) and back end (HDF5) #1407

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reflection and model serialisation: move to single file (reflex) and back end (HDF5) #1407

graeme-winter Sep 16, 2020 Maintainer

Introduction

Proposal

Why is integration special?

Motivation

Annotations

Conclusion

Replies: 8 comments

Anthchirp Sep 16, 2020 Maintainer

graeme-winter Sep 16, 2020 Maintainer Author

graeme-winter Sep 17, 2020 Maintainer Author

graeme-winter Sep 17, 2020 Maintainer Author

ndevenish Sep 18, 2020 Maintainer

graeme-winter Sep 22, 2020 Maintainer Author

phyy-nx Sep 23, 2020 Maintainer

graeme-winter May 12, 2021 Maintainer Author

graeme-winter
Sep 16, 2020
Maintainer

Anthchirp
Sep 16, 2020
Maintainer

graeme-winter
Sep 16, 2020
Maintainer Author

graeme-winter
Sep 17, 2020
Maintainer Author

graeme-winter
Sep 17, 2020
Maintainer Author

ndevenish
Sep 18, 2020
Maintainer

graeme-winter
Sep 22, 2020
Maintainer Author

phyy-nx
Sep 23, 2020
Maintainer

graeme-winter
May 12, 2021
Maintainer Author