Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dials.filter_blanks #2232

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add dials.filter_blanks #2232

wants to merge 11 commits into from

Conversation

graeme-winter
Copy link
Contributor

New program filter_blanks

To remove blank images from scans in an automated way to allow use within a pipeline. Fixes #2231. Chose not to modify detect_blanks as this is a substantial API change

To remove blank images from scans in an automated way to allow use within a
pipeline. Fixes #2231. Chose not to modify detect_blanks as this is a
substantial API change
@biochem-fan
Copy link
Member

Thanks! I am running this on my 460 crystal MicroED dataset.

Test1: on integrated.{expt,refl}
Test2: on imported.expt/strong.refl

@biochem-fan
Copy link
Member

biochem-fan commented Sep 29, 2022

        # FIXME find a more graceful way of handling this...
        if len(valid_ranges) > 1:
            sys.exit("Currently multiple blank regions unsupported")

At the moment many crystals are rejected by this line, although probably such crystals are not good anyway (i.e. very weak).
How about selecting the largest valid range?

@graeme-winter
Copy link
Contributor Author

        # FIXME find a more graceful way of handling this...
        if len(valid_ranges) > 1:
            sys.exit("Currently multiple blank regions unsupported")

At the moment many crystals are rejected by this line. How about selecting the largest valid range?

Ah, no, I will need to find a proper fix here then. I was not sure how to wrangle the experiment identifiers. I will contemplate some more.

@biochem-fan
Copy link
Member

biochem-fan commented Sep 29, 2022

Test1: This is already proving useful!

As you can see, weird R factors in some shells (0.92 - 0.90 and 0.88 - 0.86) are gone.

Note: the number of crystals are slightly different due to the above issue.

After filtering: (39 crystals)

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  6.70   2.15   3488    302   11.55  96.49       5.0    41.6    0.177    0.185    0.053    0.070   0.996*  -0.007
  2.15   1.71   4371    308   14.19 100.00       2.2    23.8    0.242    0.252    0.066    0.096   0.990*  -0.230
  1.71   1.50   4969    312   15.93 100.00       1.2    15.0    0.306    0.316    0.077    0.126   0.985*  -0.104
  1.50   1.36   4793    312   15.36 100.00       0.8     9.7    0.414    0.427    0.105    0.169   0.975*  -0.061
  1.36   1.27   5123    323   15.86 100.00       0.9     9.4    0.359    0.371    0.089    0.170   0.982*  -0.056
  1.27   1.19   4447    296   15.02 100.00       0.9     8.4    0.335    0.346    0.086    0.143   0.988*  -0.314
  1.19   1.13   5239    315   16.63 100.00       0.6     6.8    0.414    0.427    0.101    0.167   0.985*  -0.046
  1.13   1.08   5166    320   16.14 100.00       0.5     5.0    0.467    0.482    0.115    0.216   0.930*  -0.106
  1.08   1.04   4716    307   15.36 100.00       0.4     4.1    0.573    0.591    0.140    0.208   0.952*  -0.021
  1.04   1.01   4483    301   14.89 100.00       0.3     2.9    0.768    0.793    0.192    0.289   0.928*   0.019
  1.01   0.98   5242    320   16.38 100.00       0.3     2.6    0.759    0.782    0.182    0.302   0.895*   0.030
  0.98   0.95   5227    313   16.70 100.00       0.2     2.0    0.899    0.927    0.215    0.421   0.845*   0.046
  0.95   0.92   5226    329   15.88 100.00       0.2     1.7    1.122    1.157    0.273    0.406   0.826*  -0.194
  0.92   0.90   4288    288   14.89 100.00       0.1     1.2    1.636    1.687    0.399    0.470   0.729*   0.036
  0.90   0.88   4558    301   15.14 100.00       0.1     1.1    1.338    1.383    0.337    0.574   0.743*  -0.017
  0.88   0.86   4554    319   14.28  99.38       0.2     1.0    1.545    1.600    0.398    0.564   0.756*  -0.049
  0.86   0.84   3972    300   13.24  92.02       0.2     0.9    1.788    1.879    0.524    1.127   0.440*   0.068
  0.84   0.83   3373    275   12.27  84.88       0.2     0.7    1.489    1.563    0.420    0.914   0.756*   0.001
  0.83   0.81   2420    237   10.21  76.21       0.4     0.6    1.191    1.409    0.666    2.940   0.305*   0.017
  0.81   0.80   2104    197   10.68  66.55       0.3     0.8    1.410    1.554    0.566    1.484   0.456*  -0.092
  6.70   0.80  87759   5975   14.69  95.81       0.8     7.1    0.414    0.430    0.111    0.234   0.695*   0.027

Before filtering: (43 crystals)

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  7.31   2.15   4188    299   14.01  96.76       4.7    42.3    0.250    0.261    0.072    0.082   0.991*  -0.075
  2.15   1.72   5475    304   18.01 100.00       2.0    25.0    0.354    0.365    0.087    0.083   0.992*  -0.335
  1.72   1.50   6364    319   19.95 100.00       1.1    15.4    0.431    0.442    0.100    0.114   0.986*  -0.153
  1.50   1.37   5701    299   19.07 100.00       0.7     9.5    0.643    0.661    0.152    0.166   0.980*  -0.122
  1.37   1.27   6550    327   20.03 100.00       0.7     9.2    0.604    0.620    0.138    0.166   0.979*  -0.275
  1.27   1.19   5636    298   18.91 100.00       0.7     8.4    0.608    0.625    0.144    0.133   0.984*  -0.216
  1.19   1.13   6099    307   19.87 100.00       0.6     6.9    0.643    0.661    0.148    0.176   0.982*  -0.059
  1.13   1.09   6554    324   20.23 100.00       0.4     5.0    0.992    1.018    0.227    0.205   0.945*  -0.349
  1.09   1.04   5777    303   19.07 100.00       0.3     4.1    1.025    1.054    0.242    0.232   0.955*  -0.197
  1.04   1.01   5859    303   19.34 100.00       0.3     3.1    1.157    1.188    0.268    0.253   0.933*  -0.103
  1.01   0.98   5618    293   19.17 100.00       0.2     2.4    1.473    1.516    0.352    0.324   0.836*   0.196*
  0.98   0.95   6767    333   20.32 100.00       0.2     2.0    1.803    1.852    0.419    0.393   0.872*  -0.061
  0.95   0.92   6060    309   19.61 100.00       0.1     1.6    1.794    1.844    0.420    0.488   0.781*  -0.098
  0.92   0.90   5828    306   19.05 100.00       0.1     1.2  -47.824  -49.218  -11.398    0.487   0.716*  -0.111
  0.90   0.88   6080    312   19.49 100.00       0.1     1.1    8.718    8.955    2.020    0.508   0.782*  -0.007
  0.88   0.86   5140    280   18.36 100.00       0.1     0.9   10.552   10.864    2.545    0.633   0.546*  -0.127
  0.86   0.84   5611    316   17.76  95.47       0.1     0.8    3.037    3.129    0.739    0.646   0.849*   0.073
  0.84   0.83   4852    282   17.21  88.12       0.1     0.7    5.649    5.851    1.449    1.025   0.412*   0.249*
  0.83   0.81   3429    233   14.72  77.93       0.2     0.7    6.545    6.810    1.748    0.711   0.800*  -0.004
  0.81   0.80   2981    214   13.93  67.30       0.4     0.7    4.747    4.951    1.306    1.075   0.231*  -0.234
  7.30   0.80 110569   5961   18.55  96.22       0.7     7.2    0.709    0.731    0.172    0.166   0.982*  -0.120

@graeme-winter
Copy link
Contributor Author

@biochem-fan please try again - I think I have a fix in for > 1 non-blank region but would welcome verification

@graeme-winter graeme-winter marked this pull request as ready for review September 29, 2022 09:17
@graeme-winter
Copy link
Contributor Author

Anyone who looks at this please could you check the indexing around experiments in the list and refl["id"]

@biochem-fan
Copy link
Member

Test2 (filtering on strong.refl): 40 crystals

 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   r_anom   cc1/2   cc_ano
  6.23   2.14   3430    292   11.75  98.65       4.7    42.4    0.122    0.128    0.037    0.081   0.977*  -0.006
  2.14   1.71   4986    307   16.24 100.00       2.2    25.9    0.189    0.196    0.049    0.078   0.994*   0.647*
  1.71   1.50   5129    302   16.98 100.00       1.2    15.2    0.249    0.257    0.062    0.122   0.989*  -0.326
  1.50   1.36   5177    305   16.97 100.00       0.8    10.3    0.367    0.379    0.091    0.160   0.979*   0.019
  1.36   1.27   4835    289   16.73 100.00       0.7     8.4    0.360    0.372    0.090    0.152   0.987*  -0.144
  1.27   1.19   5450    308   17.69 100.00       0.8     9.9    0.305    0.314    0.074    0.143   0.989*  -0.074
  1.19   1.13   5240    307   17.07 100.00       0.6     7.1    0.393    0.405    0.097    0.166   0.985*  -0.088
  1.13   1.08   4703    273   17.23 100.00       0.5     5.4    0.484    0.499    0.120    0.175   0.939*  -0.110
  1.08   1.04   5544    314   17.66 100.00       0.3     4.1    0.662    0.683    0.164    0.240   0.938*  -0.003
  1.04   1.01   5718    321   17.81 100.00       0.3     4.2    0.580    0.598    0.140    0.230   0.950*  -0.034
  1.01   0.98   4956    291   17.03 100.00       0.2     2.5    1.065    1.098    0.261    0.332   0.867*  -0.063
  0.98   0.95   5243    301   17.42 100.00       0.2     2.2    0.981    1.011    0.238    0.389   0.836*  -0.019
  0.95   0.92   4653    273   17.04 100.00       0.1     1.6    1.535    1.580    0.370    0.460   0.757*  -0.011
  0.92   0.90   5865    318   18.44 100.00       0.1     1.6    1.325    1.363    0.313    0.457   0.703*  -0.030
  0.90   0.88   5366    306   17.54 100.00       0.1     1.1    1.963    2.025    0.483    0.556   0.543*   0.033
  0.88   0.86   5699    315   18.09 100.00       0.1     1.0    2.055    2.114    0.489    0.610   0.506*   0.016
  0.86   0.84   5033    290   17.36 100.00       0.1     0.8    2.068    2.132    0.507    0.631   0.605*   0.051
  0.84   0.83   5354    311   17.22 100.00       0.1     0.8    2.597    2.676    0.633    0.691   0.454*   0.019
  0.83   0.81   4475    251   17.83 100.00       0.1     0.7    2.682    2.760    0.640    0.734   0.414*   0.025
  0.81   0.80   4792    311   15.41  99.68       0.1     0.6    3.708    3.834    0.955    0.871   0.308*   0.021
  6.23   0.80 101648   5985   16.98  99.92       0.7     7.3    0.416    0.430    0.104    0.150   0.985*   0.159*

Not as good as Test 1 (filtering on integrated.refl), perhaps because spot finding is less robust than integration.

I will re-run my tests on your new code.

@biochem-fan
Copy link
Member

biochem-fan commented Sep 29, 2022

Those with multiple blank ranges are mostly rubbish.

20 spots found on 38 images (max 3 / bin)
          *                *          
          *                *          
          *                *          
          *          *     *          
          *          *     *          
          *          *     *          
          *          *     *          
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
2               image               39

--------------------------------------------------------------------------------
Saved 20 reflections to strong.refl

Potential blank images: 16 -> 19
Potential blank images: 34 -> 37
Potential blank images: 40 -> 52

Perhaps having a filter (in the pipeline, not necessarily within this command) by the number of spots and/or the number of frames within a valid range might be useful.

@graeme-winter
Copy link
Contributor Author

Those with multiple blank ranges are mostly rubbish.

20 spots found on 38 images (max 3 / bin)
          *                *          
          *                *          
          *                *          
          *          *     *          
          *          *     *          
          *          *     *          
          *          *     *          
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
* *  *  * ***    *   * *   ****      *
2               image               39

--------------------------------------------------------------------------------
Saved 20 reflections to strong.refl

Potential blank images: 16 -> 19
Potential blank images: 34 -> 37
Potential blank images: 40 -> 52

Perhaps having a filter (in the pipeline, not necessarily within this command) by the number of spots and/or the number of frames within a valid range might be useful.

For example, perhaps min_total_reflections? 🙂

@biochem-fan
Copy link
Member

biochem-fan commented Sep 29, 2022

dials.scale is unhappy with the output.

Checking for the existence of a reflection table 
containing multiple datasets 

Detected existence of a multi-dataset reflection table 
containing 2 datasets. 

Found 2 reflection tables & 2 experiments in total.
ValueError: Corrupted identifiers, please check input: in reflections: [], in experiment: fc59ee5d-71d3-e275-b5d3-de8472b9bc8b

During handling of the above exception, another exception occurred:

Sorry: Corrupted identifiers, please check input: in reflections: [], in experiment: fc59ee5d-71d3-e275-b5d3-de8472b9bc8b

In this case, 3 frames in the middle were removed, which I think are false negatives.

220 spots found on 48 images (max 15 / bin)
    *                                           
    *       *                                   
*   *       *                                   
*   *  *  * **                                  
*   *  *  * **   *                              
*  **  ** ****  **    *       *                 
** *** ** ****  **    *       *                 
******************   **** *****  **             
*******************  **** *****  ***            
******************** ********** ********   * * *
1                    image                    48

--------------------------------------------------------------------------------
Saved 220 reflections to strong.refl

Potential blank images: 32 -> 35

@biochem-fan
Copy link
Member

An option to trim only at the start and the end might be good (just as in the delta cc half filter in dials.scale).

@graeme-winter
Copy link
Contributor Author

I think I need to check the workings & make sure that the experiment identifier assignment is being handled correctly: quite possibly something going quite wrong here.

@codecov
Copy link

codecov bot commented Sep 29, 2022

Codecov Report

Merging #2232 (b65599c) into main (141e4fa) will decrease coverage by 0.04%.
The diff coverage is 76.78%.

@@            Coverage Diff             @@
##             main    #2232      +/-   ##
==========================================
- Coverage   80.51%   80.46%   -0.05%     
==========================================
  Files         587      589       +2     
  Lines       66907    67019     +112     
  Branches     8896     8918      +22     
==========================================
+ Hits        53868    53927      +59     
- Misses      10977    11019      +42     
- Partials     2062     2073      +11     

@graeme-winter
Copy link
Contributor Author

Just created myself a slightly artificial example case with this script

from dials.array_family import flex

data = flex.reflection_table.from_file("strong.refl")
z = data["xyzobs.px.value"].parts()[2]
data = data.select((z < 300) | (z > 550))
data.as_file("slice.refl")

and everything did appear to work as I expect -> I will have a closer look to make sure I have all my sums right. There was an error in one of the commits where I mis-assigned an experiment ID

@graeme-winter
Copy link
Contributor Author

Also:

DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 3.dev.858-gea3a72878
The following parameters have been modified:
input {
  experiments = symmetrized.expt
  reflections = symmetrized.refl
}

Checking for the existence of a reflection table 
containing multiple datasets 

Detected existence of a multi-dataset reflection table 
containing 4 datasets. 

Found 4 reflection tables & 4 experiments in total.

Dataset ids are: 0,1,2,3 

I will keep checking

@biochem-fan
Copy link
Member

@graeme-winter
I realized that sometimes a scan can be trimmed at earlier steps.

For example, dials.refine may say:

WARN: The reflections for experiment 0 do not fill the scan range.
The scan will be trimmed to images {7,51} to match the range of observed data

In this case, scan.get_array_range at

valid = array("b", [1 for _ in range(*scan.get_array_range())])
returns (6, 51) and the valid array has the size of 45.

However the loop below sets flags at the original index, invalidating wrong frames and causing out-of-bound errors when the frames at the end are trimmed. See

for blank_start, blank_end in strong_results["blank_regions"]:
logger.info(f"Potential blank images: {blank_start + 1} -> {blank_end}")
for j in range(blank_start, blank_end):
valid[j] = 0
and another loops below.

Replacing the array initialization to below seems to fix the issue but could you please look at this?
I am not sure whether values are 0-indexed or 1-indexed (see the discrepancy above: dials.refine says {7, 51} while get_array_range returns (6, 51)) and whether values are inclusive or exclusive.

        valid = array("b", [0 for _ in range(scan.get_array_range()[1])])
        for i in range(*scan.get_array_range()):
            valid[i] = 1

@dagewa
Copy link
Member

dagewa commented Dec 2, 2022

@biochem-fan this is relevant, though dismal, reading about the array/image range idiosyncrasies in dxtbx and DIALS cctbx/dxtbx#186

@dagewa
Copy link
Member

dagewa commented Dec 2, 2022

In particular, the difference between these concepts is not so much the difference between 0- and 1-based indexing, but rather whether you are labelling images with ordinal numbers, or considering a "z position" in a scan. It turns out that the array_range is always equal to image_range[0] - 1, image_range[1]. It has nothing to do with arrays. If your scan starts at image 50, say, then the lower bound of array_range will be 49, not 0.

Clearly it seems I gave up with cctbx/dxtbx#186, but if I get a clear run at it I would like to pick it up again.

@biochem-fan
Copy link
Member

Hmm, this is very confusing.

What is the index for slicing, then? My above fix still gives out-of-errors in some cases.

_expt.scan = _expt.scan[start:end]

@dagewa
Copy link
Member

dagewa commented Dec 5, 2022

I think scan slicing is done by ordinal image number. I'm not completely sure, but that's the impression I got from https://github.com/cctbx/dxtbx/blob/bb5f2e18067e0963f0beff6b449afaccf4cd0e2f/src/dxtbx/model/boost_python/scan.cc#L325.

In that case, perhaps the problem here is that start and end are ultimately sourced from array ranges, not image ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dials.detect_blanks: add option to filter output
5 participants