We should consider using HDF5 for reflection storage #1700

graeme-winter · 2021-05-12T15:10:20Z

graeme-winter
May 12, 2021
Maintainer

Using our existing data model, but storing the data in HDF5 rather than message pack, would make interaction with non-DIALS applications trivial - but more importantly allow analysis steps to only read the data they need rather than having to read all the columns. Output files could have references to columns in input in preference to copying (e.g. how index has to read all the shoeboxes and write without using).

Concrete example:

This code will read a standard DIALS reflection table file and write out the same data - shoeboxes and all - in HDF5:

from dials.array_family import flex
import h5py
import sys
import numpy as np

data = flex.reflection_table.from_file(sys.argv[1])

fout = h5py.File(sys.argv[1][:-4] + "hdf5", "w")
refl = fout.create_group("refl")

scalars = ("ext.int", "ext.double", "ext.bool", "ext.size_t")

# things for shoeboxes
dt_int = h5py.special_dtype(vlen=np.int32)
dt_float = h5py.special_dtype(vlen=np.double)

for k in data:
    dtype = str(type(data[k]))
    if "ext.int6" in dtype:
        nn = data[k].focus()[0]
        column = data[k].as_int().as_numpy_array().reshape((nn, 6))
        refl.create_dataset(k, data = column)
    elif "ext.vec3_double" in dtype:
        refl.create_dataset(k, data = data[k].as_numpy_array())
    elif "ext.miller_index" in dtype:
        # this feels dumb
        column = data[k].as_vec3_double().iround().as_int().as_numpy_array()
        refl.create_dataset(k, data = column)
    elif any(scalar in dtype for scalar in scalars):
        refl.create_dataset(k, data = data[k].as_numpy_array())
    elif "ext.shoebox" in dtype:
        nn = (data[k].size(),)
        dset_data = refl.create_dataset(f"{k}.data", nn, dtype=dt_float)
        dset_mask = refl.create_dataset(f"{k}.mask", nn, dtype=dt_int)
        for j, sb in enumerate(data[k]):
            dset_data[j] = sb.data.as_1d().as_numpy_array()
            dset_mask[j] = sb.mask.as_1d().as_numpy_array()

This appears to work for our principle data files:

Grey-Area index-extend :( $ dials.python refl_to_h5.py indexed.refl 
Grey-Area index-extend :) $ du -hs indexed.*
 28K	indexed.expt
 46M	indexed.hdf5
 45M	indexed.refl
Grey-Area index-extend :) $ dials.python refl_to_h5.py integrated.refl
Grey-Area index-extend :) $ du -hs integrated.*
180K	integrated.expt
 38M	integrated.hdf5
 38M	integrated.refl

This would not need an extended discussion of the file format - we can just use what we have already, using the same names which map to the names we use in the code, so that we have a 1:1 mapping to the in-memory representation (rather than tryting to convert to and from NXreflections or something - that could for sure be an export format)

Grey-Area index-extend :) $ h5ls -rv integrated.hdf5 
Opened "integrated.hdf5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/refl                    Group
    Location:  1:800
    Links:     1
/refl/background.mean    Dataset {112943/112943}
    Location:  1:1832
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/background.sum.value Dataset {112943/112943}
    Location:  1:905976
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/background.sum.variance Dataset {112943/112943}
    Location:  1:906248
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/bbox               Dataset {112943/112943, 6/6}
    Location:  1:906520
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native int
/refl/d                  Dataset {112943/112943}
    Location:  1:906968
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/entering           Dataset {112943/112943}
    Location:  1:907240
    Links:     1
    Storage:   112943 logical bytes, 112943 allocated bytes, 100.00% utilization
    Type:      enum native signed char {
                   FALSE            = 0
                   TRUE             = 1
               }
/refl/flags              Dataset {112943/112943}
    Location:  1:907512
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native unsigned long
/refl/id                 Dataset {112943/112943}
    Location:  1:7345775
    Links:     1
    Storage:   451772 logical bytes, 451772 allocated bytes, 100.00% utilization
    Type:      native int
/refl/intensity.prf.value Dataset {112943/112943}
    Location:  1:7346047
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/intensity.prf.variance Dataset {112943/112943}
    Location:  1:7346647
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/intensity.sum.value Dataset {112943/112943}
    Location:  1:7347271
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/intensity.sum.variance Dataset {112943/112943}
    Location:  1:7347543
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/lp                 Dataset {112943/112943}
    Location:  1:11413771
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/miller_index       Dataset {338829/338829}
    Location:  1:11414371
    Links:     1
    Storage:   1355316 logical bytes, 1355316 allocated bytes, 100.00% utilization
    Type:      native int
/refl/num_pixels.background Dataset {112943/112943}
    Location:  1:11414643
    Links:     1
    Storage:   451772 logical bytes, 451772 allocated bytes, 100.00% utilization
    Type:      native int
/refl/num_pixels.background_used Dataset {112943/112943}
    Location:  1:11414915
    Links:     1
    Storage:   451772 logical bytes, 451772 allocated bytes, 100.00% utilization
    Type:      native int
/refl/num_pixels.foreground Dataset {112943/112943}
    Location:  1:11415187
    Links:     1
    Storage:   451772 logical bytes, 451772 allocated bytes, 100.00% utilization
    Type:      native int
/refl/num_pixels.valid   Dataset {112943/112943}
    Location:  1:15029995
    Links:     1
    Storage:   451772 logical bytes, 451772 allocated bytes, 100.00% utilization
    Type:      native int
/refl/panel              Dataset {112943/112943}
    Location:  1:7346919
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native unsigned long
/refl/partial_id         Dataset {112943/112943}
    Location:  1:15030971
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native unsigned long
/refl/partiality         Dataset {112943/112943}
    Location:  1:15031243
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/profile.correlation Dataset {112943/112943}
    Location:  1:18194447
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/qe                 Dataset {112943/112943}
    Location:  1:18194719
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double
/refl/s1                 Dataset {112943/112943, 3/3}
    Location:  1:18194991
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzcal.mm          Dataset {112943/112943, 3/3}
    Location:  1:18195263
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzcal.px          Dataset {112943/112943, 3/3}
    Location:  1:18195863
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzobs.mm.value    Dataset {112943/112943, 3/3}
    Location:  1:18196135
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzobs.mm.variance Dataset {112943/112943, 3/3}
    Location:  1:30846111
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzobs.px.value    Dataset {112943/112943, 3/3}
    Location:  1:30846383
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/xyzobs.px.variance Dataset {112943/112943, 3/3}
    Location:  1:30846983
    Links:     1
    Storage:   2710632 logical bytes, 2710632 allocated bytes, 100.00% utilization
    Type:      native double
/refl/zeta               Dataset {112943/112943}
    Location:  1:30847255
    Links:     1
    Storage:   903544 logical bytes, 903544 allocated bytes, 100.00% utilization
    Type:      native double

I have made no effort to compress the data but it would be trivial to consider doing so. Also, since this is writing a column at a time we don't have the explosion at the end of a process where multiple copies of data exist to write out in messagepack.

I am certain that going down this road is a lot more flexible than trying to convert to / from NXreflections or whatever.

phyy-nx · 2021-05-12T15:57:29Z

phyy-nx
May 12, 2021
Maintainer

Just a friendly reminder we already have code that writes out NXreflections from dials input (dials.export format=nxs). It's only missing a few bits, like shoeboxes, and you can add those bits without incurring any real cost in the same way you do in your example code, and without needing to go through a committee. It includes serialization and deserialization. I even did a bunch of work to get it to pass NeXus validation.

Your code above essentially duplicates what is in dials/util/nexus/nx_reflections.py. I still think it would be nearly trivial to use a real standard, instead of creating a zillion more files that only dials will ever be able to deal with.

8 replies

phyy-nx May 12, 2021
Maintainer

Again, the mapping is already done in nx_reflections. @jmp1985 worked it out. Example.

Here's the round trip demo, part of our regression tests: https://github.com/dials/dials/blob/main/test/util/test_nexus.py

graeme-winter May 12, 2021
Maintainer Author

I remark that this is our internal working format, not an archival format. That said, I'll investigate the proposal you make.

graeme-winter May 13, 2021
Maintainer Author

Checked this out, looks like it is unsurprisingly incomplete.

#1702

Is an update to the test which would be necessary but not sufficient to accept the assertion that this covers everything we do.

graeme-winter May 13, 2021
Maintainer Author

https://manual.nexusformat.org/classes/base_classes/NXreflections.html needs extending to accomodate

shoeboxes
everything to do with scaling etc.

So I guess we need to make a proposal for discussion / evaluation by the NeXus people before we proceed with anything

graeme-winter May 13, 2021
Maintainer Author

Conversely, if we want to make it easy for non-DIALS applications to read the files we generate (one of your justifications above), using a pre-existing standard would make this much less painful. Otherwise, we will need to formally publish our specification so that others may interpret it, and if we then later decide we want to make changes (because we want to change our underlying data model and we have one:one mappping between our data model and what we write to file) then we risk breaking interoperability with those non-DIALS programs. IMO we would need a very good reason to NOT use a pre-existing standard, where one is available.

Does not being able to represent what we need to represent count as a "very good reason"? What about wanting to keep the XYZ positions that are 3 vectors together as a 3-vector rather than splitting into 3 different data sets (miller index -> h, k, l data sets; xyzcal.mm -> predicted_x, _y, _frame as two examples, there are many)

Everywhere the rotation angle is "phi" in the standard which annoyingly has a meaning which is not the same on all of the multi-axis goniometers we deal with. A source of confusion, to me.

No place to store the derived scale factors etc. (which would need to be discussed and agreed as that would be a replacement then for MTZ, not a DIALS working format.)

phyy-nx · 2021-05-12T17:26:20Z

phyy-nx
May 12, 2021
Maintainer

The real trick here (questions of file format put aside) is that using flex's as_numpy_array intrinsically adds a copy step to serialization. Not a big deal for small datasets, but for datasets with millions of reflections it's a performance issue.
We have a lot of code in cctbx.xfel.merge to handle deserializing reflection tables from hundreds of thousands of images, each with up to 10s of thousands of reflections, and that code has to handle running out of memory and handing off work between many MPI ranks. Adding a copy step to it would be killer. But, that code doesn't know how to only load a portion of a file at once! How can we do that in hdf5? We need a specific use case to talk about.

Here's the use case:

selection = table['id'] == experiment_id
intensities = table['intensity.sum.value'][selection]

What's the code under the hood that makes that work?

Read the whole dataset, calling as_numpy_array on each column. That's essentially what @graeme-winter's example code does, and what dials/util/nexus/nx_reflections.py does. Big performance and memory hits.
Make a new hdf5 backend for flex.reflection_table that reads data 'on demand'. This would make flex.reflection_table.from_file be nearly a no-op, as no data is read from disc until you ask for it. Use the h5py/numpy/flex.as_numpy_array() layer because reading only portions of files is presumably fast enough.
As 2, but use slicing aware c++ code to read directly to flex.

That last one is a lot like dataset_as_flex in dxtbx/format/nexus.py.

Anyway, there's a lot more to think about here. How does writing data in an 'on demand' implementation work? Is there a way to include an index to make selections faster? At what point do we realize we are implementing a relational database and we realize we should use a mysql backend instead? :D

2 replies

graeme-winter May 13, 2021
Maintainer Author

My implementation was a demo, but I take your point. Resolving this would need clever use of chunking but could be doable. Direct C-level reading would also help.

graeme-winter May 13, 2021
Maintainer Author

Database back end is not compatible with the average user 🙄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should consider using HDF5 for reflection storage #1700

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

We should consider using HDF5 for reflection storage #1700

graeme-winter May 12, 2021 Maintainer

Replies: 2 comments · 10 replies

phyy-nx May 12, 2021 Maintainer

phyy-nx May 12, 2021 Maintainer

graeme-winter May 12, 2021 Maintainer Author

graeme-winter May 13, 2021 Maintainer Author

graeme-winter May 13, 2021 Maintainer Author

graeme-winter May 13, 2021 Maintainer Author

phyy-nx May 12, 2021 Maintainer

graeme-winter May 13, 2021 Maintainer Author

graeme-winter May 13, 2021 Maintainer Author

graeme-winter
May 12, 2021
Maintainer

Replies: 2 comments 10 replies

phyy-nx
May 12, 2021
Maintainer

phyy-nx May 12, 2021
Maintainer

graeme-winter May 12, 2021
Maintainer Author

graeme-winter May 13, 2021
Maintainer Author

graeme-winter May 13, 2021
Maintainer Author

graeme-winter May 13, 2021
Maintainer Author

phyy-nx
May 12, 2021
Maintainer

graeme-winter May 13, 2021
Maintainer Author

graeme-winter May 13, 2021
Maintainer Author