Add dials.filter_blanks #2232

graeme-winter · 2022-09-29T08:05:25Z

New program filter_blanks

To remove blank images from scans in an automated way to allow use within a pipeline. Fixes #2231. Chose not to modify detect_blanks as this is a substantial API change

biochem-fan · 2022-09-29T08:31:04Z

Thanks! I am running this on my 460 crystal MicroED dataset.

Test1: on integrated.{expt,refl}
Test2: on imported.expt/strong.refl

biochem-fan · 2022-09-29T08:34:34Z

        # FIXME find a more graceful way of handling this...
        if len(valid_ranges) > 1:
            sys.exit("Currently multiple blank regions unsupported")

At the moment many crystals are rejected by this line, although probably such crystals are not good anyway (i.e. very weak).
How about selecting the largest valid range?

graeme-winter · 2022-09-29T08:36:44Z

        # FIXME find a more graceful way of handling this...
        if len(valid_ranges) > 1:
            sys.exit("Currently multiple blank regions unsupported")

At the moment many crystals are rejected by this line. How about selecting the largest valid range?

Ah, no, I will need to find a proper fix here then. I was not sure how to wrangle the experiment identifiers. I will contemplate some more.

biochem-fan · 2022-09-29T08:47:47Z

Test1: This is already proving useful!

As you can see, weird R factors in some shells (0.92 - 0.90 and 0.88 - 0.86) are gone.

Note: the number of crystals are slightly different due to the above issue.

After filtering: (39 crystals)

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  6.70   2.15   3488    302   11.55  96.49       5.0    41.6    0.177    0.185    0.053    0.070   0.996*  -0.007
  2.15   1.71   4371    308   14.19 100.00       2.2    23.8    0.242    0.252    0.066    0.096   0.990*  -0.230
  1.71   1.50   4969    312   15.93 100.00       1.2    15.0    0.306    0.316    0.077    0.126   0.985*  -0.104
  1.50   1.36   4793    312   15.36 100.00       0.8     9.7    0.414    0.427    0.105    0.169   0.975*  -0.061
  1.36   1.27   5123    323   15.86 100.00       0.9     9.4    0.359    0.371    0.089    0.170   0.982*  -0.056
  1.27   1.19   4447    296   15.02 100.00       0.9     8.4    0.335    0.346    0.086    0.143   0.988*  -0.314
  1.19   1.13   5239    315   16.63 100.00       0.6     6.8    0.414    0.427    0.101    0.167   0.985*  -0.046
  1.13   1.08   5166    320   16.14 100.00       0.5     5.0    0.467    0.482    0.115    0.216   0.930*  -0.106
  1.08   1.04   4716    307   15.36 100.00       0.4     4.1    0.573    0.591    0.140    0.208   0.952*  -0.021
  1.04   1.01   4483    301   14.89 100.00       0.3     2.9    0.768    0.793    0.192    0.289   0.928*   0.019
  1.01   0.98   5242    320   16.38 100.00       0.3     2.6    0.759    0.782    0.182    0.302   0.895*   0.030
  0.98   0.95   5227    313   16.70 100.00       0.2     2.0    0.899    0.927    0.215    0.421   0.845*   0.046
  0.95   0.92   5226    329   15.88 100.00       0.2     1.7    1.122    1.157    0.273    0.406   0.826*  -0.194
  0.92   0.90   4288    288   14.89 100.00       0.1     1.2    1.636    1.687    0.399    0.470   0.729*   0.036
  0.90   0.88   4558    301   15.14 100.00       0.1     1.1    1.338    1.383    0.337    0.574   0.743*  -0.017
  0.88   0.86   4554    319   14.28  99.38       0.2     1.0    1.545    1.600    0.398    0.564   0.756*  -0.049
  0.86   0.84   3972    300   13.24  92.02       0.2     0.9    1.788    1.879    0.524    1.127   0.440*   0.068
  0.84   0.83   3373    275   12.27  84.88       0.2     0.7    1.489    1.563    0.420    0.914   0.756*   0.001
  0.83   0.81   2420    237   10.21  76.21       0.4     0.6    1.191    1.409    0.666    2.940   0.305*   0.017
  0.81   0.80   2104    197   10.68  66.55       0.3     0.8    1.410    1.554    0.566    1.484   0.456*  -0.092
  6.70   0.80  87759   5975   14.69  95.81       0.8     7.1    0.414    0.430    0.111    0.234   0.695*   0.027

Before filtering: (43 crystals)

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  7.31   2.15   4188    299   14.01  96.76       4.7    42.3    0.250    0.261    0.072    0.082   0.991*  -0.075
  2.15   1.72   5475    304   18.01 100.00       2.0    25.0    0.354    0.365    0.087    0.083   0.992*  -0.335
  1.72   1.50   6364    319   19.95 100.00       1.1    15.4    0.431    0.442    0.100    0.114   0.986*  -0.153
  1.50   1.37   5701    299   19.07 100.00       0.7     9.5    0.643    0.661    0.152    0.166   0.980*  -0.122
  1.37   1.27   6550    327   20.03 100.00       0.7     9.2    0.604    0.620    0.138    0.166   0.979*  -0.275
  1.27   1.19   5636    298   18.91 100.00       0.7     8.4    0.608    0.625    0.144    0.133   0.984*  -0.216
  1.19   1.13   6099    307   19.87 100.00       0.6     6.9    0.643    0.661    0.148    0.176   0.982*  -0.059
  1.13   1.09   6554    324   20.23 100.00       0.4     5.0    0.992    1.018    0.227    0.205   0.945*  -0.349
  1.09   1.04   5777    303   19.07 100.00       0.3     4.1    1.025    1.054    0.242    0.232   0.955*  -0.197
  1.04   1.01   5859    303   19.34 100.00       0.3     3.1    1.157    1.188    0.268    0.253   0.933*  -0.103
  1.01   0.98   5618    293   19.17 100.00       0.2     2.4    1.473    1.516    0.352    0.324   0.836*   0.196*
  0.98   0.95   6767    333   20.32 100.00       0.2     2.0    1.803    1.852    0.419    0.393   0.872*  -0.061
  0.95   0.92   6060    309   19.61 100.00       0.1     1.6    1.794    1.844    0.420    0.488   0.781*  -0.098
  0.92   0.90   5828    306   19.05 100.00       0.1     1.2  -47.824  -49.218  -11.398    0.487   0.716*  -0.111
  0.90   0.88   6080    312   19.49 100.00       0.1     1.1    8.718    8.955    2.020    0.508   0.782*  -0.007
  0.88   0.86   5140    280   18.36 100.00       0.1     0.9   10.552   10.864    2.545    0.633   0.546*  -0.127
  0.86   0.84   5611    316   17.76  95.47       0.1     0.8    3.037    3.129    0.739    0.646   0.849*   0.073
  0.84   0.83   4852    282   17.21  88.12       0.1     0.7    5.649    5.851    1.449    1.025   0.412*   0.249*
  0.83   0.81   3429    233   14.72  77.93       0.2     0.7    6.545    6.810    1.748    0.711   0.800*  -0.004
  0.81   0.80   2981    214   13.93  67.30       0.4     0.7    4.747    4.951    1.306    1.075   0.231*  -0.234
  7.30   0.80 110569   5961   18.55  96.22       0.7     7.2    0.709    0.731    0.172    0.166   0.982*  -0.120

graeme-winter · 2022-09-29T09:16:32Z

@biochem-fan please try again - I think I have a fix in for > 1 non-blank region but would welcome verification

graeme-winter · 2022-09-29T09:46:30Z

Anyone who looks at this please could you check the indexing around experiments in the list and refl["id"]

biochem-fan · 2022-09-29T10:20:47Z

Test2 (filtering on strong.refl): 40 crystals

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  6.23   2.14   3430    292   11.75  98.65       4.7    42.4    0.122    0.128    0.037    0.081   0.977*  -0.006
  2.14   1.71   4986    307   16.24 100.00       2.2    25.9    0.189    0.196    0.049    0.078   0.994*   0.647*
  1.71   1.50   5129    302   16.98 100.00       1.2    15.2    0.249    0.257    0.062    0.122   0.989*  -0.326
  1.50   1.36   5177    305   16.97 100.00       0.8    10.3    0.367    0.379    0.091    0.160   0.979*   0.019
  1.36   1.27   4835    289   16.73 100.00       0.7     8.4    0.360    0.372    0.090    0.152   0.987*  -0.144
  1.27   1.19   5450    308   17.69 100.00       0.8     9.9    0.305    0.314    0.074    0.143   0.989*  -0.074
  1.19   1.13   5240    307   17.07 100.00       0.6     7.1    0.393    0.405    0.097    0.166   0.985*  -0.088
  1.13   1.08   4703    273   17.23 100.00       0.5     5.4    0.484    0.499    0.120    0.175   0.939*  -0.110
  1.08   1.04   5544    314   17.66 100.00       0.3     4.1    0.662    0.683    0.164    0.240   0.938*  -0.003
  1.04   1.01   5718    321   17.81 100.00       0.3     4.2    0.580    0.598    0.140    0.230   0.950*  -0.034
  1.01   0.98   4956    291   17.03 100.00       0.2     2.5    1.065    1.098    0.261    0.332   0.867*  -0.063
  0.98   0.95   5243    301   17.42 100.00       0.2     2.2    0.981    1.011    0.238    0.389   0.836*  -0.019
  0.95   0.92   4653    273   17.04 100.00       0.1     1.6    1.535    1.580    0.370    0.460   0.757*  -0.011
  0.92   0.90   5865    318   18.44 100.00       0.1     1.6    1.325    1.363    0.313    0.457   0.703*  -0.030
  0.90   0.88   5366    306   17.54 100.00       0.1     1.1    1.963    2.025    0.483    0.556   0.543*   0.033
  0.88   0.86   5699    315   18.09 100.00       0.1     1.0    2.055    2.114    0.489    0.610   0.506*   0.016
  0.86   0.84   5033    290   17.36 100.00       0.1     0.8    2.068    2.132    0.507    0.631   0.605*   0.051
  0.84   0.83   5354    311   17.22 100.00       0.1     0.8    2.597    2.676    0.633    0.691   0.454*   0.019
  0.83   0.81   4475    251   17.83 100.00       0.1     0.7    2.682    2.760    0.640    0.734   0.414*   0.025
  0.81   0.80   4792    311   15.41  99.68       0.1     0.6    3.708    3.834    0.955    0.871   0.308*   0.021
  6.23   0.80 101648   5985   16.98  99.92       0.7     7.3    0.416    0.430    0.104    0.150   0.985*   0.159*

Not as good as Test 1 (filtering on integrated.refl), perhaps because spot finding is less robust than integration.

I will re-run my tests on your new code.

biochem-fan · 2022-09-29T10:24:58Z

Those with multiple blank ranges are mostly rubbish.

20 spots found on 38 images (max 3 / bin)
          *                *          
          *                *          
          *                *          
          *          *     *          
          *          *     *          
          *          *     *          
          *          *     *          
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
2               image               39

--------------------------------------------------------------------------------
Saved 20 reflections to strong.refl

Potential blank images: 16 -> 19
Potential blank images: 34 -> 37
Potential blank images: 40 -> 52

Perhaps having a filter (in the pipeline, not necessarily within this command) by the number of spots and/or the number of frames within a valid range might be useful.

graeme-winter · 2022-09-29T10:28:28Z

Those with multiple blank ranges are mostly rubbish.

20 spots found on 38 images (max 3 / bin)
          *                *          
          *                *          
          *                *          
          *          *     *          
          *          *     *          
          *          *     *          
          *          *     *          
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
2               image               39

--------------------------------------------------------------------------------
Saved 20 reflections to strong.refl

Potential blank images: 16 -> 19
Potential blank images: 34 -> 37
Potential blank images: 40 -> 52

Perhaps having a filter (in the pipeline, not necessarily within this command) by the number of spots and/or the number of frames within a valid range might be useful.

For example, perhaps min_total_reflections? 🙂

biochem-fan · 2022-09-29T11:29:03Z

dials.scale is unhappy with the output.

Checking for the existence of a reflection table 
containing multiple datasets 

Detected existence of a multi-dataset reflection table 
containing 2 datasets. 

Found 2 reflection tables & 2 experiments in total.
ValueError: Corrupted identifiers, please check input: in reflections: [], in experiment: fc59ee5d-71d3-e275-b5d3-de8472b9bc8b

During handling of the above exception, another exception occurred:

Sorry: Corrupted identifiers, please check input: in reflections: [], in experiment: fc59ee5d-71d3-e275-b5d3-de8472b9bc8b

In this case, 3 frames in the middle were removed, which I think are false negatives.

220 spots found on 48 images (max 15 / bin)
    *                                           
    *       *                                   
*   *       *                                   
*   *  *  * **                                  
*   *  *  * **   *                              
*  **  ** ****  **    *       *                 
** *** ** ****  **    *       *                 
******************   **** *****  **             
*******************  **** *****  ***            
******************** ********** ********   * * *
1                    image                    48

--------------------------------------------------------------------------------
Saved 220 reflections to strong.refl

Potential blank images: 32 -> 35

biochem-fan · 2022-09-29T11:32:42Z

An option to trim only at the start and the end might be good (just as in the delta cc half filter in dials.scale).

graeme-winter · 2022-09-29T11:34:28Z

I think I need to check the workings & make sure that the experiment identifier assignment is being handled correctly: quite possibly something going quite wrong here.

codecov · 2022-09-29T12:51:32Z

Codecov Report

Merging #2232 (b65599c) into main (141e4fa) will decrease coverage by 0.04%.
The diff coverage is 76.78%.

@@            Coverage Diff             @@
##             main    #2232      +/-   ##
==========================================
- Coverage   80.51%   80.46%   -0.05%     
==========================================
  Files         587      589       +2     
  Lines       66907    67019     +112     
  Branches     8896     8918      +22     
==========================================
+ Hits        53868    53927      +59     
- Misses      10977    11019      +42     
- Partials     2062     2073      +11

graeme-winter · 2022-09-29T12:58:58Z

Just created myself a slightly artificial example case with this script

from dials.array_family import flex

data = flex.reflection_table.from_file("strong.refl")
z = data["xyzobs.px.value"].parts()[2]
data = data.select((z < 300) | (z > 550))
data.as_file("slice.refl")

and everything did appear to work as I expect -> I will have a closer look to make sure I have all my sums right. There was an error in one of the commits where I mis-assigned an experiment ID

graeme-winter · 2022-09-29T13:01:16Z

Also:

DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 3.dev.858-gea3a72878
The following parameters have been modified:
input {
  experiments = symmetrized.expt
  reflections = symmetrized.refl
}

Checking for the existence of a reflection table 
containing multiple datasets 

Detected existence of a multi-dataset reflection table 
containing 4 datasets. 

Found 4 reflection tables & 4 experiments in total.

Dataset ids are: 0,1,2,3

I will keep checking

biochem-fan · 2022-12-02T06:34:43Z

@graeme-winter
I realized that sometimes a scan can be trimmed at earlier steps.

For example, dials.refine may say:

WARN: The reflections for experiment 0 do not fill the scan range.
The scan will be trimmed to images {7,51} to match the range of observed data

In this case, scan.get_array_range at

dials/src/dials/command_line/filter_blanks.py

Line 121 in b65599c

valid = array("b", [1 for _ in range(*scan.get_array_range())])

returns (6, 51) and the valid array has the size of 45.

However the loop below sets flags at the original index, invalidating wrong frames and causing out-of-bound errors when the frames at the end are trimmed. See

dials/src/dials/command_line/filter_blanks.py

Lines 136 to 139 in b65599c

    
           for blank_start, blank_end in strong_results["blank_regions"]: 
        
               logger.info(f"Potential blank images: {blank_start + 1} -> {blank_end}") 
        
               for j in range(blank_start, blank_end): 
        
                   valid[j] = 0

and another loops below.

Replacing the array initialization to below seems to fix the issue but could you please look at this?
I am not sure whether values are 0-indexed or 1-indexed (see the discrepancy above: dials.refine says {7, 51} while get_array_range returns (6, 51)) and whether values are inclusive or exclusive.

        valid = array("b", [0 for _ in range(scan.get_array_range()[1])])
        for i in range(*scan.get_array_range()):
            valid[i] = 1

dagewa · 2022-12-02T10:20:48Z

@biochem-fan this is relevant, though dismal, reading about the array/image range idiosyncrasies in dxtbx and DIALS cctbx/dxtbx#186

dagewa · 2022-12-02T10:45:30Z

In particular, the difference between these concepts is not so much the difference between 0- and 1-based indexing, but rather whether you are labelling images with ordinal numbers, or considering a "z position" in a scan. It turns out that the array_range is always equal to image_range[0] - 1, image_range[1]. It has nothing to do with arrays. If your scan starts at image 50, say, then the lower bound of array_range will be 49, not 0.

Clearly it seems I gave up with cctbx/dxtbx#186, but if I get a clear run at it I would like to pick it up again.

biochem-fan · 2022-12-05T08:35:26Z

Hmm, this is very confusing.

What is the index for slicing, then? My above fix still gives out-of-errors in some cases.

dials/src/dials/command_line/filter_blanks.py

Line 178 in b65599c

_expt.scan = _expt.scan[start:end]

dagewa · 2022-12-05T13:44:41Z

I think scan slicing is done by ordinal image number. I'm not completely sure, but that's the impression I got from https://github.com/cctbx/dxtbx/blob/bb5f2e18067e0963f0beff6b449afaccf4cd0e2f/src/dxtbx/model/boost_python/scan.cc#L325.

In that case, perhaps the problem here is that start and end are ultimately sourced from array ranges, not image ranges.

graeme-winter added 3 commits September 28, 2022 16:47

Implement as filter blanks to allow a new API

b66ce81

New program filter_blanks

2fb27e0

To remove blank images from scans in an automated way to allow use within a pipeline. Fixes #2231. Chose not to modify detect_blanks as this is a substantial API change

News

db22190

Add test

f025f53

Cope with multiple blank regions

12c2b05

graeme-winter marked this pull request as ready for review September 29, 2022 09:17

DiamondLightSource-build-server and others added 4 commits September 29, 2022 09:17

Rename newsfragments/XXX.feature to newsfragments/2232.feature

95a333b

min_total_reflections

e7d5ba5

min_total_reflections redux

a92feeb

Counting is hard

ea3a728

graeme-winter requested a review from jbeilstenedmands September 29, 2022 09:57

graeme-winter and others added 2 commits October 3, 2022 08:41

If indexed, remove unindexed

d07a76f

Merge branch 'main' into blank-refactor

b65599c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dials.filter_blanks #2232

Add dials.filter_blanks #2232

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 •

edited

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 •

edited

graeme-winter commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 •

edited

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 •

edited

biochem-fan commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

codecov bot commented Sep 29, 2022 •

edited

graeme-winter commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

biochem-fan commented Dec 2, 2022

dagewa commented Dec 2, 2022

dagewa commented Dec 2, 2022

biochem-fan commented Dec 5, 2022

dagewa commented Dec 5, 2022

Add dials.filter_blanks #2232

Are you sure you want to change the base?

Add dials.filter_blanks #2232

Conversation

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 • edited

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 • edited

graeme-winter commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 • edited

graeme-winter commented Sep 29, 2022

biochem-fan commented Sep 29, 2022 • edited

biochem-fan commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

codecov bot commented Sep 29, 2022 • edited

Codecov Report

graeme-winter commented Sep 29, 2022

graeme-winter commented Sep 29, 2022

biochem-fan commented Dec 2, 2022

dagewa commented Dec 2, 2022

dagewa commented Dec 2, 2022

biochem-fan commented Dec 5, 2022

dagewa commented Dec 5, 2022

biochem-fan commented Sep 29, 2022 •

edited

biochem-fan commented Sep 29, 2022 •

edited

biochem-fan commented Sep 29, 2022 •

edited

biochem-fan commented Sep 29, 2022 •

edited

codecov bot commented Sep 29, 2022 •

edited