Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List-mode objective function: minor refactor and extra functionality #1418

Merged
merged 15 commits into from May 14, 2024

Conversation

KrisThielemans
Copy link
Collaborator

This PR is intended to add objective function value as well as Hessian to the list-mode objective function. To avoid duplication between cached and non-cached versions, the plan is as follows:

  1. split reading of a "batch" in a separate function
  2. rewrite gradient to always use LM_distributable_computation by reading the batch either from cache or from list-mode.
  3. add function for objective function
  4. add Hessian

@gschramm @NikEfth Feedback welcome!

@KrisThielemans
Copy link
Collaborator Author

step 2 commited, but currently failing test. It'll be for tomorrow.

@KrisThielemans
Copy link
Collaborator Author

@NikEfth Steps 1&2 completed. Variable naming is a bit awkard, but minimises text changes. Best to compare ignore white-space diffs: https://github.com/UCL/STIR/pull/1418/files?diff=unified&w=1

@KrisThielemans
Copy link
Collaborator Author

Preparation is nearly done. Results for the gradient are still identical. I plan to template the loop function, and then call it

@KrisThielemans
Copy link
Collaborator Author

The Hessian calculation should now be functional, but it is untested ATM.

@KrisThielemans
Copy link
Collaborator Author

@gschramm conducted some timings via SIRF (but that shouldn't matter). Original post at SyneRBI/SIRF#1253 (comment)

Timing results of Sino/LM gradients/hessian multiplies for different number of subsets

mMR LM file with 4,332,816 counts and acq_model = sirf.STIR.AcquisitionModelUsingRayTracingMatrix() and 1 tangential LOR, sirf.STIR.AcquisitionData.set_storage_scheme("memory")

7 subsets

In [14]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
4.19 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.35 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.18 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.13 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

14 subsets

In [22]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.75 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [23]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
3.07 s ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [26]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.67 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.89 s ± 68.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

21 subsets

In [32]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.35 s ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [33]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
2.69 s ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.35 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.14 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CPU info

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7662 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1499.555
CPU max MHz:                     2000.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4000.00
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-255

@gschramm In [35] is still a gradient?

My interpretation:

  • For projection data, more subsets will mean faster execution (less projections), until there are too many subsets so parallelisation gets inefficient.
  • For list-mode data, it could be that the reading of the lm dominates. I'm not sure though why more subsets would slow it down. It's always going to read all of the data. It's possible that the accumulation of the output images in LM_distributable_computation is important
      if (output_image_ptr != NULL)
        {
          for (int i = 0; i < static_cast<int>(local_output_image_sptrs.size()); ++i)
            if (!is_null_ptr(local_output_image_sptrs[i])) // only accumulate if a thread filled something in
              *output_image_ptr += *(local_output_image_sptrs[i]);
        }
    but this doesn't depend on the number of subsets really (I guess we could use a reduction operation there.)

Note that LM times could improve when using set_cache_max_size(5000000), which would do some operations upfront, and then re-read things from myCACHE*.bin.

@KrisThielemans KrisThielemans force-pushed the ListModeObjFuncExtras branch 2 times, most recently from e0e4f55 to 1c86e4a Compare May 13, 2024 19:12
- split caching in separate functions for reading a batch from
list-mode file or cache
- always use LM_distributable_computation for gradient

Resulting code avoids duplication and is faster when caching is not used.
The code pre-allocated bins for every thread, but local variables
are allocated on the stack, which should take no time. Removing
th epre-allocations simplifies the code, and prepares us for re-use
for Hessian calculation etc
- make list-mode loop into templated function
- introduce separate functions for gradient and Hessian
- move LM_distributable_function implementation to .txx as it now uses
 a template for the callback
- implement actual_accumulate_Hessian... for LM
- try to implement accumulation of log-likelihood

WARNING: LM_distributable_computation is now generic and
incompatible with the previous one (which was only for the gradient)

Hessian function and log-likelihood value are currently untested/
Also expand test to use all subsets, make it slower of course.
test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData and test_priors
contained nearly similar code, so moved it to a new class.

This needed minor modifications to the tests, but also addition of
GeneralisedObjectiveFunction::compute_value()
- compute_objective_function() is now implemented as well
- Tests are (almost) a duplicate of the ProjData version, but
  gradient-test is disabled as for this sample data it is too slow
Also update to upload-artefacts@v4
@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented May 14, 2024

The MacOS job fails on the Hessian-concavity test. The output of H*dx is 0 on MacOS, but not on other systems. I have no idea why. Most likely it doesn't read the events properly, but it does read the correct number of events, so that's hard to understand. Here's the failure diagnostics

----- testing concavity via Hessian-vector product (accumulate_Hessian_times_input)
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.03s, wall-clock 0.037424s
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.04s, wall-clock 0.037748s
Error : 0 is larger than 0, 
FAIL: PoissonLLListModeData: Computation of x^T H x = 0.000000 > 0 (Hessian) and is therefore NOT concave
 >target image max=0.999849
 >target image min=1.16359e-05
 >output image max=0
 >output image min=0

@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented May 14, 2024

All done, or as far as I can go. The objective function value is also computed and tested.

Note that some of the numerical tests aren't very stringent, and they probably can't be as there are only 8019 prompts in the file.

I will have to disable the MacOS job for later

Cannot figure out what the problem is, but the test works on Ubuntu and Windows
@KrisThielemans KrisThielemans merged commit 4f30cbf into UCL:master May 14, 2024
8 checks passed
@KrisThielemans KrisThielemans deleted the ListModeObjFuncExtras branch May 14, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant