List-mode objective function: minor refactor and extra functionality #1418

KrisThielemans · 2024-05-09T21:42:14Z

This PR is intended to add objective function value as well as Hessian to the list-mode objective function. To avoid duplication between cached and non-cached versions, the plan is as follows:

split reading of a "batch" in a separate function
rewrite gradient to always use LM_distributable_computation by reading the batch either from cache or from list-mode.
add function for objective function
add Hessian

@gschramm @NikEfth Feedback welcome!

KrisThielemans · 2024-05-09T23:02:41Z

step 2 commited, but currently failing test. It'll be for tomorrow.

KrisThielemans · 2024-05-10T09:04:25Z

@NikEfth Steps 1&2 completed. Variable naming is a bit awkard, but minimises text changes. Best to compare ignore white-space diffs: https://github.com/UCL/STIR/pull/1418/files?diff=unified&w=1

KrisThielemans · 2024-05-11T10:30:42Z

Preparation is nearly done. Results for the gradient are still identical. I plan to template the loop function, and then call it

KrisThielemans · 2024-05-12T11:36:10Z

The Hessian calculation should now be functional, but it is untested ATM.

KrisThielemans · 2024-05-13T18:08:43Z

@gschramm conducted some timings via SIRF (but that shouldn't matter). Original post at SyneRBI/SIRF#1253 (comment)

Timing results of Sino/LM gradients/hessian multiplies for different number of subsets

mMR LM file with 4,332,816 counts and acq_model = sirf.STIR.AcquisitionModelUsingRayTracingMatrix() and 1 tangential LOR, sirf.STIR.AcquisitionData.set_storage_scheme("memory")

7 subsets

In [14]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
4.19 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.35 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.18 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.13 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

14 subsets

In [22]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.75 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [23]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
3.07 s ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [26]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.67 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.89 s ± 68.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

21 subsets

In [32]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.35 s ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [33]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
2.69 s ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.35 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.14 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CPU info

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7662 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1499.555
CPU max MHz:                     2000.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4000.00
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-255

@gschramm In [35] is still a gradient?

My interpretation:

For projection data, more subsets will mean faster execution (less projections), until there are too many subsets so parallelisation gets inefficient.
For list-mode data, it could be that the reading of the lm dominates. I'm not sure though why more subsets would slow it down. It's always going to read all of the data. It's possible that the accumulation of the output images in LM_distributable_computation is important
```
  if (output_image_ptr != NULL)
    {
      for (int i = 0; i < static_cast<int>(local_output_image_sptrs.size()); ++i)
        if (!is_null_ptr(local_output_image_sptrs[i])) // only accumulate if a thread filled something in
          *output_image_ptr += *(local_output_image_sptrs[i]);
    }
```
but this doesn't depend on the number of subsets really (I guess we could use a reduction operation there.)

Note that LM times could improve when using set_cache_max_size(5000000), which would do some operations upfront, and then re-read things from myCACHE*.bin.

- split caching in separate functions for reading a batch from list-mode file or cache - always use LM_distributable_computation for gradient Resulting code avoids duplication and is faster when caching is not used.

The code pre-allocated bins for every thread, but local variables are allocated on the stack, which should take no time. Removing th epre-allocations simplifies the code, and prepares us for re-use for Hessian calculation etc

- make list-mode loop into templated function - introduce separate functions for gradient and Hessian

- move LM_distributable_function implementation to .txx as it now uses a template for the callback - implement actual_accumulate_Hessian... for LM - try to implement accumulation of log-likelihood WARNING: LM_distributable_computation is now generic and incompatible with the previous one (which was only for the gradient) Hessian function and log-likelihood value are currently untested/

Also expand test to use all subsets, make it slower of course.

test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData and test_priors contained nearly similar code, so moved it to a new class. This needed minor modifications to the tests, but also addition of GeneralisedObjectiveFunction::compute_value()

- compute_objective_function() is now implemented as well - Tests are (almost) a duplicate of the ProjData version, but gradient-test is disabled as for this sample data it is too slow

Also update to upload-artefacts@v4

also minor clean-up

KrisThielemans · 2024-05-14T07:53:56Z

The MacOS job fails on the Hessian-concavity test. The output of H*dx is 0 on MacOS, but not on other systems. I have no idea why. Most likely it doesn't read the events properly, but it does read the correct number of events, so that's hard to understand. Here's the failure diagnostics

----- testing concavity via Hessian-vector product (accumulate_Hessian_times_input)
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.03s, wall-clock 0.037424s
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.04s, wall-clock 0.037748s
Error : 0 is larger than 0, 
FAIL: PoissonLLListModeData: Computation of x^T H x = 0.000000 > 0 (Hessian) and is therefore NOT concave
 >target image max=0.999849
 >target image min=1.16359e-05
 >output image max=0
 >output image min=0

KrisThielemans · 2024-05-14T09:28:19Z

All done, or as far as I can go. The objective function value is also computed and tested.

Note that some of the numerical tests aren't very stringent, and they probably can't be as there are only 8019 prompts in the file.

I will have to disable the MacOS job for later

Cannot figure out what the problem is, but the test works on Ubuntu and Windows

KrisThielemans added enhancement in-progress labels May 9, 2024

KrisThielemans self-assigned this May 9, 2024

KrisThielemans added this to the v6.1 milestone May 9, 2024

KrisThielemans force-pushed the ListModeObjFuncExtras branch from f57ce57 to 2270c16 Compare May 9, 2024 22:40

KrisThielemans force-pushed the ListModeObjFuncExtras branch from b168163 to 854a8ee Compare May 10, 2024 09:02

KrisThielemans force-pushed the ListModeObjFuncExtras branch from 24f369d to df5058a Compare May 12, 2024 09:01

KrisThielemans mentioned this pull request May 12, 2024

list-mode obj_fun.get_subset_sensitivity(0) returns None SyneRBI/SIRF#1251

Open

KrisThielemans force-pushed the ListModeObjFuncExtras branch from 5c50e1d to dcf76a0 Compare May 13, 2024 15:25

KrisThielemans force-pushed the ListModeObjFuncExtras branch 2 times, most recently from e0e4f55 to 1c86e4a Compare May 13, 2024 19:12

KrisThielemans added 13 commits May 13, 2024 22:42

List-mode objective function: always uses OpenMP for gradient

9273703

- split caching in separate functions for reading a batch from list-mode file or cache - always use LM_distributable_computation for gradient Resulting code avoids duplication and is faster when caching is not used.

use more local variables in LM_distributable_computation

eca9c89

The code pre-allocated bins for every thread, but local variables are allocated on the stack, which should take no time. Removing th epre-allocations simplifies the code, and prepares us for re-use for Hessian calculation etc

prepare LM_distributable function for Hessian

81c5b64

- make list-mode loop into templated function - introduce separate functions for gradient and Hessian

minor clean-up and clarifications

7c1fc00

fix error when OpenMP is not present

233dead

Add compute_gradient* (as opposed to only subset)

c296721

Also expand test to use all subsets, make it slower of course.

add numerical test for accumulate_Hessian_times_input

6437d85

fix sign of LM Hessian

92448a3

LM obj-fun: add value() and tests

7f6b2de

- compute_objective_function() is now implemented as well - Tests are (almost) a duplicate of the ProjData version, but gradient-test is disabled as for this sample data it is too slow

LM objfun tests slightly better diagnostics

418c194

[GHA] upload ctest artefact if failure

fe6bc39

Also update to upload-artefacts@v4

KrisThielemans force-pushed the ListModeObjFuncExtras branch from 1c86e4a to fe6bc39 Compare May 13, 2024 21:48

add "increment" test for gradient (enabled for LM data)

267c161

also minor clean-up

KrisThielemans linked an issue May 14, 2024 that may be closed by this pull request

listmode objective function: actual_compute_objective_function_without_penalty (and Hessian) not yet implemented #1406

Closed

KrisThielemans removed the in-progress label May 14, 2024

[GHA] disabled LM objfunc test on MacOS

490771b

Cannot figure out what the problem is, but the test works on Ubuntu and Windows

KrisThielemans force-pushed the ListModeObjFuncExtras branch from 4635891 to 490771b Compare May 14, 2024 10:28

KrisThielemans merged commit 4f30cbf into UCL:master May 14, 2024
8 checks passed

KrisThielemans deleted the ListModeObjFuncExtras branch May 14, 2024 11:22

KrisThielemans mentioned this pull request May 14, 2024

MacOS job fails on the list-mode Hessian-concavity test #1422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List-mode objective function: minor refactor and extra functionality #1418

List-mode objective function: minor refactor and extra functionality #1418

KrisThielemans commented May 9, 2024

KrisThielemans commented May 9, 2024

KrisThielemans commented May 10, 2024

KrisThielemans commented May 11, 2024

KrisThielemans commented May 12, 2024

KrisThielemans commented May 13, 2024

KrisThielemans commented May 14, 2024 •

edited

KrisThielemans commented May 14, 2024 •

edited

List-mode objective function: minor refactor and extra functionality #1418

List-mode objective function: minor refactor and extra functionality #1418

Conversation

KrisThielemans commented May 9, 2024

KrisThielemans commented May 9, 2024

KrisThielemans commented May 10, 2024

KrisThielemans commented May 11, 2024

KrisThielemans commented May 12, 2024

KrisThielemans commented May 13, 2024

KrisThielemans commented May 14, 2024 • edited

KrisThielemans commented May 14, 2024 • edited

KrisThielemans commented May 14, 2024 •

edited

KrisThielemans commented May 14, 2024 •

edited