Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking ideas for performance improvements with large segmentation files #1728

Closed
CPBridge opened this issue Nov 12, 2022 · 26 comments
Closed

Comments

@CPBridge
Copy link
Contributor

CPBridge commented Nov 12, 2022

Hi,

A group of us in the highdicom community (@hackermd @pieper @fedorov) have been playing around with large DICOM Segmentation files and have found some performance issues with pydicom that are affecting us quite badly. We are happy to contribute to pydicom to try and fix these issues but would appreciate any thoughts/guidance as to what the most likely avenues for improvements may be.

Specifically we are trying to parse segmentations with large numbers of segments. See for example this file shared by @pieper, which is a segmentation of a CT scan with 98 organs of interest segmented in 643 slices giving over 12000 frames in the segmetation dataset. So certainly large but something that may occur relatively often in practice. Simply calling dcmread on this file from the local filesystem takes over 5s on my latest gen arm64 macbook, which we feel is much slower than desirable for many applications (obviously times on lower end hardware are worse, it was taking @pieper around 15s in a colab notebook). The file is huge and obviously the bulk of this is pixel data, however, to our surprise, specifying stop_before_pixels=True does not significantly affect the read time. We conclude that the bottleneck is related to reading in the highly nested metadata, most likely dominated by the items of PerFrameFunctionalGroupsSequence, which contains over 12000 items, each of which is a nested dataset with a relatively small number of elements within it.

Does anyone in the pydicom community have any idea how/whether we may be able to make improvements to pydicom to bring this down to something more practical?

See also this related highdicom issue (note that we are also exploring other means to improve the general issue of highdicom's slow perforamance with large segmetnations, including improvements to highdicom and possibly changes to the standard, but here we are narrowly interested in what can be done about this slow dcmread).

Thanks in advance for any suggestions!

@CPBridge
Copy link
Contributor Author

Here is a dumb profile of the dcmread call on the file using the profile module:

import profile
import pydicom

profile.run("pydicom.dcmread('ABD_LYMPH_008_SEG.dcm')")

Output:

         23560826 function calls (22416545 primitive calls) in 42.767 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   245566    0.185    0.000    0.185    0.000 :0(__new__)
   205722    0.160    0.000    0.160    0.000 :0(append)
   102861    0.066    0.000    0.066    0.000 :0(b2a_hex)
        1    0.000    0.000    0.000    0.000 :0(calcsize)
        1    0.000    0.000    0.000    0.000 :0(close)
   334954    0.492    0.000    0.492    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(endswith)
        1    0.000    0.000   42.767   42.767 :0(exec)
        1    0.000    0.000    0.000    0.000 :0(format)
        2    0.000    0.000    0.000    0.000 :0(fspath)
  2057329    1.079    0.000    1.079    0.000 :0(get)
        1    0.000    0.000    0.000    0.000 :0(getattr)
        1    0.000    0.000    0.000    0.000 :0(hash)
  3003963    1.883    0.000    1.883    0.000 :0(isinstance)
        1    0.000    0.000    0.000    0.000 :0(items)
   102863    0.609    0.000    1.197    0.000 :0(join)
        2    0.000    0.000    0.000    0.000 :0(keys)
   835791    0.895    0.000    0.895    0.000 :0(len)
  1954447    2.014    0.000    2.014    0.000 :0(match)
334952/66    0.998    0.000   42.763    0.648 :0(next)
        1    0.001    0.001    0.001    0.001 :0(open)
   965014    1.421    0.000    1.421    0.000 :0(read)
        2    0.000    0.000    0.000    0.000 :0(rstrip)
   102867    0.136    0.000    0.136    0.000 :0(seek)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sorted)
        2    0.000    0.000    0.000    0.000 :0(split)
        2    0.000    0.000    0.000    0.000 :0(stat)
        4    0.000    0.000    0.000    0.000 :0(strip)
  1103941    1.672    0.000    1.672    0.000 :0(tell)
   719447    1.024    0.000    1.024    0.000 :0(unpack)
   245564    0.342    0.000    0.527    0.000 <string>:1(<lambda>)
        1    0.000    0.000   42.767   42.767 <string>:1(<module>)
   192250    0.259    0.000    0.394    0.000 __init__.py:1455(debug)
   192250    0.135    0.000    0.135    0.000 __init__.py:1724(isEnabledFor)
   178775    0.104    0.000    0.104    0.000 config.py:199(reading_validation_mode)
  1954449    2.774    0.000    3.797    0.000 datadict.py:459(tag_for_keyword)
  1954446    2.229    0.000    2.229    0.000 datadict.py:498(repeater_has_keyword)
    89388    0.275    0.000    0.926    0.000 dataelem.py:153(__init__)
   102866    0.141    0.000    0.141    0.000 dataelem.py:43(empty_value_for_VR)
        6    0.000    0.000    0.000    0.000 dataelem.py:430(value)
    89385    0.223    0.000    0.541    0.000 dataelem.py:435(value)
        1    0.000    0.000    0.000    0.000 dataelem.py:452(VM)
    89385    0.165    0.000    0.216    0.000 dataelem.py:497(_convert_value)
        3    0.000    0.000    0.000    0.000 dataelem.py:786(DataElement_from_raw)
        7    0.000    0.000    0.000    0.000 dataset.py:1114(get_item)
   102865    0.321    0.000    2.386    0.000 dataset.py:1180(set_original_encoding)
        8    0.000    0.000    0.000    0.000 dataset.py:1242(elements)
  1954446    5.742    0.000   13.781    0.000 dataset.py:2107(__setattr__)
        1    0.000    0.000    0.000    0.000 dataset.py:2164(_set_file_meta)
        3    0.000    0.000    0.000    0.000 dataset.py:2181(__setitem__)
        1    0.000    0.000    0.000    0.000 dataset.py:2320(update)
        1    0.000    0.000    0.000    0.000 dataset.py:2619(__init__)
        1    0.000    0.000    0.000    0.000 dataset.py:2815(__init__)
        1    0.000    0.000    0.000    0.000 dataset.py:2847(validate)
        1    0.000    0.000    0.000    0.000 dataset.py:2873(<listcomp>)
        3    0.000    0.000    0.000    0.000 dataset.py:2880(__setitem__)
   102866    1.328    0.000   11.080    0.000 dataset.py:368(__init__)
        1    0.000    0.000    0.000    0.000 dataset.py:484(__contains__)
      6/4    0.000    0.000    0.000    0.000 dataset.py:726(get)
        1    0.000    0.000    0.000    0.000 dataset.py:771(items)
        2    0.000    0.000    0.000    0.000 dataset.py:805(__getattr__)
        4    0.000    0.000    0.000    0.000 dataset.py:836(_character_set)
     11/5    0.000    0.000    0.000    0.000 dataset.py:853(__getitem__)
   102864    0.501    0.000    0.852    0.000 filereader.py:287(_is_implicit_vr)
 102864/3    1.422    0.000   42.764   14.255 filereader.py:358(read_dataset)
437813/66    5.322    0.000   42.763    0.648 filereader.py:41(data_element_generator)
  89385/6    0.866    0.000   42.702    7.117 filereader.py:461(read_sequence)
192246/12846    2.094    0.000   42.489    0.003 filereader.py:497(read_sequence_item)
        1    0.000    0.000    0.000    0.000 filereader.py:560(_read_command_set_elements)
        2    0.000    0.000    0.000    0.000 filereader.py:581(_not_group_0000)
        1    0.000    0.000    0.001    0.001 filereader.py:593(_read_file_meta_info)
        8    0.000    0.000    0.000    0.000 filereader.py:614(_not_group_0002)
        1    0.000    0.000    0.000    0.000 filereader.py:671(read_preamble)
        1    0.000    0.000   42.765   42.765 filereader.py:738(read_partial)
        1    0.000    0.000   42.767   42.767 filereader.py:897(dcmread)
        2    0.000    0.000    0.000    0.000 fileutil.py:414(path_from_pathlike)
        1    0.000    0.000    0.000    0.000 fileutil.py:436(_unpack_tag)
        1    0.000    0.000    0.000    0.000 genericpath.py:16(exists)
   102861    1.007    0.000    2.388    0.000 hexutil.py:43(bytes2hex)
   925749    0.588    0.000    0.588    0.000 hexutil.py:46(<genexpr>)
   102865    0.074    0.000    0.074    0.000 misc.py:15(size_in_bytes)
    89385    0.422    0.000    0.783    0.000 multival.py:31(__init__)
        1    0.000    0.000   42.767   42.767 profile:0(dcm = pydicom.dcmread('/Users/cpb28/Downloads/ABD_LYMPH_008_SEG.dcm'))
        0    0.000             0.000          profile:0(profiler)
        2    0.000    0.000    0.000    0.000 re.py:187(match)
        2    0.000    0.000    0.000    0.000 re.py:288(_compile)
   102861    0.112    0.000    0.166    0.000 sequence.py:15(validate_dataset)
    89385    0.227    0.000    1.063    0.000 sequence.py:34(__init__)
        6    0.000    0.000    0.000    0.000 tag.py:157(__lt__)
   875648    1.305    0.000    4.594    0.000 tag.py:176(__eq__)
   102867    0.123    0.000    1.348    0.000 tag.py:187(__ne__)
       23    0.000    0.000    0.000    0.000 tag.py:204(group)
        3    0.000    0.000    0.000    0.000 tag.py:216(is_private)
   334952    0.249    0.000    0.249    0.000 tag.py:230(TupleTag)
   295132    1.578    0.000    2.811    0.000 tag.py:42(Tag)
        1    0.000    0.000    0.000    0.000 typing.py:1037(__hash__)
   102876    0.058    0.000    0.058    0.000 typing.py:1736(cast)
   102871    0.076    0.000    0.076    0.000 typing.py:306(inner)
        1    0.000    0.000    0.000    0.000 uid.py:186(name)
        2    0.000    0.000    0.000    0.000 uid.py:70(__new__)
        2    0.000    0.000    0.000    0.000 valuerep.py:105(validate_regex)
        2    0.000    0.000    0.000    0.000 valuerep.py:1180(MultiString)
        2    0.000    0.000    0.000    0.000 valuerep.py:134(validate_length_and_regex)
    89385    0.068    0.000    0.068    0.000 valuerep.py:1917(__getattr__)
        2    0.000    0.000    0.000    0.000 valuerep.py:255(validate_value)
        2    0.000    0.000    0.000    0.000 valuerep.py:80(validate_vr_length)
        1    0.000    0.000    0.000    0.000 values.py:351(convert_numbers)
        2    0.000    0.000    0.000    0.000 values.py:644(convert_UI)
        3    0.000    0.000    0.000    0.000 values.py:708(convert_value)

@mrbean-bremen
Copy link
Member

Thanks for the analysis!

From your profiling, it looks like read_sequence_item is indeed the main bottleneck.
The sequence items themselfes are not parsed on reading the files (they read in a raw data blob until accessed, or at least this is as it should be), but given that there are a lot of them, the overhead for each one may be a problem. The reading itself may also be problematic - if the length of a sequence item is not known, it has to be scanned for the end tag, and maybe reading the data from file is not done efficiently. Using a read-ahead buffer might help there.

This is all guesswork until more profiling has been done, but it is certainly something to be looked at.

@darcymason
Copy link
Member

I enjoy optimization problems, and find it very satisfying when improvements are added to pydicom, so I welcome any speed-ups that can come out of this, assuming they don't overly complicate code or slow down other types of reading.

I'm downloading the file and will have a closer look when I can.

Meanwhile, one question is about use cases: the optimization may well depend on how many of these nested items you need to access. If all of them, I'm not sure there is much that can be done, the cost will have to be paid at some point. In other words, it might just come down to the serial nature of DICOM and having to read through to get at any particular data element.

Also, do you have any comparison of time to read with other DICOM toolkits?

@pieper
Copy link
Contributor

pieper commented Nov 12, 2022

Thanks @darcymason and @mrbean-bremen 👍 I agree, this can make for an interesting programming challenge, but also we feel it's important to characterize the performance of this use case in order to see how the standard will scale. As @CPBridge points, out, this particular SEG is large, but the CT is of a pretty standard clinical size and the ML tool we are using, TotalSegmentator, turns out to be very robust at generating these segmentations on pretty much any CT I've given it (not always perfect, but pretty remarkable really).

Also, do you have any comparison of time to read with other DICOM toolkits?

We have started experimenting in OHIF where we use dcmjs and in Slicer, where we currently use dcmqi, built on dcmtk. @dclunie is also testing with his java tools, PixelMed I believe, but we can verify. Too early to make definitive statements, but short answer is that they all seem slower than we'd like them to be. As one might expect, each language and design decisions bring their own tradeoffs, and the optimization approaches may differ.

I'd think we can define a common set of operations to compare the implementations. Something like: time to read header, time to unpack pixel data, time to assemble the segments into a labelmap. In addition to timings there are also memory footprint issues depending on how you want to use the pixel data (e.g. one common way to access the segmentation in highdicom returns 100 volumes of 512x512x643).

I'm hoping we can use this kind of cross-toolkit research not only to improve performance (and possibly the standard itself) but also to test interoperability of these derived objects.

@hackermd
Copy link
Contributor

hackermd commented Nov 13, 2022

Thanks @darcymason and @mrbean-bremen for your help in trying to get to the bottom of it.

@CPBridge and @pieper already nicely described the use case for which we need to optimize performance. I would just like to add that are additional use cases. In digital pathology, VL Whole Slide Microscopy Image instances with TILED_SPARSE Dimension Organization Type suffer the same problem. While we are working on addressing the bottleneck by improving the standard, there will continue to be instances with > 100,000 Per Frame Functional Groups Sequence items and we will need to improve performance in pydicom.

Playing around with a test dataset, I made some interesting observations:

from datetime import datetime

import highdicom as hd
from pydicom.dataset import Dataset
from pydicom.uid import VLWholeSlideMicroscopyImageStorage


if __name__ == '__main__':

    print('create dataset...')
    dataset = Dataset()
    dataset.SOPClassUID = VLWholeSlideMicroscopyImageStorage
    dataset.ImageOrientationSlide = [1, 0, 0, 0, 1, 0]
    dataset.Rows = 256
    dataset.Columns = 256
    dataset.TotalPixelMatrixRows = 10000
    dataset.TotalPixelMatrixColumns = 40000

    ops_item = Dataset()
    dataset.OpticalPathSequence = [ops_item]

    tpmos_item = Dataset()
    tpmos_item.XOffsetInSlideCoordinateSystem = 0.0
    tpmos_item.YOffsetInSlideCoordinateSystem = 0.0
    dataset.TotalPixelMatrixOriginSequence = [tpmos_item]

    sfgs_item = Dataset()
    pms_item = Dataset()
    pms_item.PixelSpacing = [0.0001, 0.0001]
    sfgs_item.PixelMeasuresSequence = [pms_item]
    dataset.SharedFunctionalGroupsSequence = [sfgs_item]

    dataset.PerFrameFunctionalGroupsSequence = []
    for pps in hd.utils.compute_plane_position_slide_per_frame(dataset):
        ppfgs_item = Dataset()
        ppfgs_item.PlanePositionSequence = pps
        dataset.PerFrameFunctionalGroupsSequence.append(ppfgs_item)
    dataset.NumberOfFrames = str(len(dataset.PerFrameFunctionalGroupsSequence))

    print('iterate over PFFG sequence items...')
    start = datetime.now()
    for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
        pps_item = pffgs_item.PlanePositionSequence[0]
        (
            pps_item.RowPositionInTotalImagePixelMatrix,
            pps_item.ColumnPositionInTotalImagePixelMatrix,
        )
    end = datetime.now()
    diff = end - start
    print(f'iteration took {diff}')

    print('iterate over PFFG sequence item indices...')
    start = datetime.now()
    for i in range(int(dataset.NumberOfFrames)):
        pffgs_item = dataset.PerFrameFunctionalGroupsSequence[i]
        pps_item = pffgs_item.PlanePositionSequence[0]
        (
            pps_item.RowPositionInTotalImagePixelMatrix,
            pps_item.ColumnPositionInTotalImagePixelMatrix,
        )
    end = datetime.now()
    diff = end - start
    print(f'iteration took {diff}')
create dataset...
iterate over PFFG sequence items...
iteration took 0:00:00.040761
iterate over PFFG sequence item indices...
iteration took 0:00:34.294024

It appears that indexing into large sequences is a major performance bottleneck.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Nov 13, 2022

Hm, as I wrote, the sequence items are first read in a blob for performance reasons (so that they can just be written back if needed), but at the moment values in the sequence item are accessed, they are parsed and DataElement instances are created, which most likely is what takes the time here.
As @darcymason commented above:

the optimization may well depend on how many of these nested items you need to access. If all of them, I'm not sure there is much that can be done, the cost will have to be paid at some point

@hackermd
Copy link
Contributor

Note that in the above example no data is actually read from any file. The dataset is entirely created in memory.

@darcymason
Copy link
Member

Note that in the above example no data is actually read from any file. The dataset is entirely created in memory.

Right, that makes it really interesting then... I'll try to take a closer look at this later today.

@hackermd
Copy link
Contributor

I've updated the example to also profile the two for loops:

import profile

import highdicom as hd
from pydicom.dataset import Dataset
from pydicom.uid import VLWholeSlideMicroscopyImageStorage


def iterate(dataset):
    for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
        pps_item = pffgs_item.PlanePositionSequence[0]
        (
            pps_item.RowPositionInTotalImagePixelMatrix,
            pps_item.ColumnPositionInTotalImagePixelMatrix,
        )


def index(dataset):
    for i in range(int(dataset.NumberOfFrames)):
        pffgs_item = dataset.PerFrameFunctionalGroupsSequence[i]
        pps_item = pffgs_item.PlanePositionSequence[0]
        (
            pps_item.RowPositionInTotalImagePixelMatrix,
            pps_item.ColumnPositionInTotalImagePixelMatrix,
        )


if __name__ == '__main__':

    print('create dataset...')
    dataset = Dataset()
    dataset.SOPClassUID = VLWholeSlideMicroscopyImageStorage
    dataset.ImageOrientationSlide = [1, 0, 0, 0, 1, 0]
    dataset.Rows = 256
    dataset.Columns = 256
    dataset.TotalPixelMatrixRows = 10000
    dataset.TotalPixelMatrixColumns = 40000

    ops_item = Dataset()
    dataset.OpticalPathSequence = [ops_item]

    tpmos_item = Dataset()
    tpmos_item.XOffsetInSlideCoordinateSystem = 0.0
    tpmos_item.YOffsetInSlideCoordinateSystem = 0.0
    dataset.TotalPixelMatrixOriginSequence = [tpmos_item]

    sfgs_item = Dataset()
    pms_item = Dataset()
    pms_item.PixelSpacing = [0.0001, 0.0001]
    sfgs_item.PixelMeasuresSequence = [pms_item]
    dataset.SharedFunctionalGroupsSequence = [sfgs_item]

    dataset.PerFrameFunctionalGroupsSequence = []
    for pps in hd.utils.compute_plane_position_slide_per_frame(dataset):
        ppfgs_item = Dataset()
        ppfgs_item.PlanePositionSequence = pps
        dataset.PerFrameFunctionalGroupsSequence.append(ppfgs_item)
    dataset.NumberOfFrames = str(len(dataset.PerFrameFunctionalGroupsSequence))

    print('iterate over PFFG sequence items...')
    profile.run('iterate(dataset)')

    print('index PFFG sequence items...')
    profile.run('index(dataset)')
create dataset...
iterate over PFFG sequence items...
         427070 function calls in 0.522 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.520    0.520 :0(exec)
    31401    0.015    0.000    0.015    0.000 :0(get)
   157009    0.094    0.000    0.094    0.000 :0(isinstance)
     6281    0.003    0.000    0.003    0.000 :0(len)
    12560    0.010    0.000    0.010    0.000 :0(match)
        1    0.002    0.002    0.002    0.002 :0(setprofile)
        1    0.000    0.000    0.520    0.520 <string>:1(<module>)
    31401    0.031    0.000    0.046    0.000 datadict.py:459(tag_for_keyword)
    12560    0.012    0.000    0.012    0.000 datadict.py:498(repeater_has_keyword)
    31403    0.015    0.000    0.015    0.000 dataelem.py:430(value)
     6281    0.007    0.000    0.016    0.000 dataset.py:1270(__ne__)
    12560    0.034    0.000    0.075    0.000 dataset.py:2107(__setattr__)
     6281    0.007    0.000    0.010    0.000 dataset.py:693(__eq__)
    18841    0.074    0.000    0.492    0.000 dataset.py:805(__getattr__)
    18841    0.080    0.000    0.271    0.000 dataset.py:853(__getitem__)
     6281    0.003    0.000    0.003    0.000 multival.py:107(__iter__)
     6281    0.006    0.000    0.009    0.000 multival.py:110(__len__)
     6280    0.003    0.000    0.003    0.000 multival.py:99(__getitem__)
        1    0.000    0.000    0.522    0.522 profile:0(iterate(dataset))
        0    0.000             0.000          profile:0(profiler)
     6281    0.017    0.000    0.108    0.000 sequence.py:111(parent)
    37682    0.039    0.000    0.057    0.000 tag.py:176(__eq__)
    18841    0.048    0.000    0.081    0.000 tag.py:42(Tag)
        1    0.021    0.021    0.520    0.520 test.py:9(iterate)

index PFFG sequence items...
         197738381 function calls in 251.935 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  251.935  251.935 :0(exec)
 39469801   17.952    0.000   17.952    0.000 :0(get)
   213528    0.132    0.000    0.132    0.000 :0(isinstance)
    12560    0.006    0.000    0.006    0.000 :0(len)
 39444680   30.652    0.000   30.652    0.000 :0(match)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000  251.935  251.935 <string>:1(<module>)
 39469801   37.668    0.000   55.620    0.000 datadict.py:459(tag_for_keyword)
 39444680   36.890    0.000   36.890    0.000 datadict.py:498(repeater_has_keyword)
    50241    0.026    0.000    0.026    0.000 dataelem.py:430(value)
    12560    0.015    0.000    0.035    0.000 dataset.py:1270(__ne__)
 39444680   99.159    0.000  222.281    0.000 dataset.py:2107(__setattr__)
    12560    0.014    0.000    0.020    0.000 dataset.py:693(__eq__)
    25121    0.104    0.000  251.892    0.010 dataset.py:805(__getattr__)
    25121    0.127    0.000  251.579    0.010 dataset.py:853(__getitem__)
    12560    0.014    0.000    0.020    0.000 multival.py:110(__len__)
    12560    0.008    0.000    0.008    0.000 multival.py:99(__getitem__)
        1    0.000    0.000  251.935  251.935 profile:0(index(dataset))
        0    0.000             0.000          profile:0(profiler)
    12560   29.011    0.002  251.328    0.020 sequence.py:111(parent)
    50242    0.054    0.000    0.078    0.000 tag.py:176(__eq__)
    25121    0.067    0.000    0.115    0.000 tag.py:42(Tag)
        1    0.036    0.036  251.935  251.935 test.py:18(index)

@hackermd
Copy link
Contributor

It's interesting that Dataset.__setitem__ is called so often. Could Sequence.parent be part of the problem?

pydicom/pydicom/sequence.py

Lines 120 to 121 in a8be738

for item in self._list:
item.parent = self._parent

@darcymason
Copy link
Member

One thing that jumps out at me is the tremendous increase in calls to tag_for_keyword, which brings up two points:

  1. The lookup from dataset should be moved outside the loop:
def index(dataset):
    func_group_seq = dataset.PerFrameFunctionalGroupsSequence
    for i in range(int(dataset.NumberOfFrames)):
        pffgs_item = func_group_seq[i]
        ...

and
2. Since tag_for_keyword is just a one-liner:

return keyword_dict.get(keyword)

it could just be in-lined in dataset.py to save function call overhead.

That's only a part of the problem, but on my PC the index-run is taking so long, I'm still waiting for it... I'll need a simpler example to work with to dig deeper.

@pieper
Copy link
Contributor

pieper commented Nov 13, 2022

@darcymason here are a multiframe MR and corresponding SEG that may be good for testing. We can also dig up even smaller ones if that would help.

https://s3.amazonaws.com/IsomicsPublic/SampleData/MRHead/MRHead-multiframe.dcm

https://s3.amazonaws.com/IsomicsPublic/SampleData/MRHead/aseg-t1space.dcm

@darcymason
Copy link
Member

Probs going to be okay. So far moving that lookup outside the loop has taken almost all the time difference away. I'll post results shortly, just double-checking I didn't change something else.

I'm guessing it is related to parent setting @hackermd mentioned - probably causing a recursive cascade each time through the loop somehow... but I'll keep digging.

@darcymason
Copy link
Member

Okay, confirmed. I'm getting different times on each run, but the index one is ~10% longer, sometimes the same or even less, if I add a single line to pydicom to avoid setting the Sequence parent if it is already set.

So ... definitely a bug in pydicom related to setting its parent when returning a Sequence. In the iterator, the sequence is only looked up once, likewise if the dataset.<sequence keyword> is moved outside the loop, so this recursive parent setting is not triggered nearly as often.

I'm not sure the one-line solution I did is the best answer, I'll think it out a bit more and put in a PR in the next couple of days.

I don't know if this helps at all with the performance issues raised earlier in this issue - probably not, but we can keep looking.

@hackermd
Copy link
Contributor

Thanks so much for looking into the issue @darcymason.

In the iterator, the sequence is only looked up once, likewise if the dataset. is moved outside the loop, so this recursive parent setting is not triggered nearly as often.

Good point! This is certainly helpful and we will go through the highdicom code base to identify statements that could be improved in this regard.

I don't know if this helps at all with the performance issues raised earlier in this issue - probably not, but we can keep looking.

Not sure either. It would probably be worth carefully reviewing the implementation of the __getattr__, __setattr__, __getitem__, and __setitem__ methods from a performance perspective.

@darcymason
Copy link
Member

Another thought/observation on @hackermd's recent iterating example:

Another way to move something out of the loop is to avoid the repeated keyword->tag lookup, followed by return the .value of the DataElement, which pydicom does under the hood. You can look up the tag number yourself outside the loop and then get the .value:

from pydicom.tag import Tag
def iterate_preload_tags(dataset):
    row_pos_tag = Tag("RowPositionInTotalImagePixelMatrix")
    col_pos_tag = Tag("ColumnPositionInTotalImagePixelMatrix")
    for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
        pps_item = pffgs_item.PlanePositionSequence[0]
        (
            pps_item[row_pos_tag].value,
            pps_item[col_pos_tag].value,
        )

When I profile this against the other iterator, I see ~20% speed improvement. You lose a little in code readability (although you could use the full keyword for the tag name if you wanted), but in really critical areas where a few tags are used repeatedly, this could help a bit.

@hackermd
Copy link
Contributor

Thanks for investigating further @darcymason. I think it's a good idea to rely on the lower-level interface and use the tags directly for repeated access of data elements in loops.

Do you think it would help if the Dataset would perform the lookup operations directly on the keyword_dict rather than calling tag_for_keyword() repeatedly. Not sure how much interpreter overhead there is in function calls, but small things like this may ultimately add up as well.

@darcymason
Copy link
Member

darcymason commented Nov 17, 2022

Do you think it would help if the Dataset would perform the lookup operations directly on the keyword_dict rather than calling tag_for_keyword() repeatedly

It would help a little, but not significantly, I don't think.

I always feel that when needing speed-ups, one really wants at least factor of 2, or even better factor 10.

So I tried something very interesting, just to get an idea of the ultimate speed this little bit of code could reach in Python:

  • I did dcmwrite of your created dataset to a dicom file.
  • I pydicom codify'd that file.
  • I removed the ds.save_as line at the end (not needed)
  • at the top, I replaced pydicom imports with:
class Dataset:
    pass

class FileMetaDataset:
    pass

Sequence = list
  • then in the profiling code I exec(open(filename).read()) that file and set dataset=ds from the codified code.

This has the effect of removing pydicom's overhead, including __getattr__, __getitem__ etc., to speak to your points about those earlier perhaps being costly.

(Found mistake, see next post)
When I first ran it, I got 0 time (!), so I added a for i in range(100): loop around the existing loop. Then I got a total time about 0.25 seconds vs 1.5 on my laptop for the original iteration run.

So using simple python classes led to about factor 600 speed-up. I thought maybe I had done something wrong, but I added some print statements, checked the len of the sequence, etc. and the original data structure does seem to be there. I expected a big difference, but not that much. However, moving most operations into CPython code vs. actual Python code would be expected to yield a big increase.

Of course, this only works in this limited example. These simple structures would not handle private data elements, or writing out the file again, so many other pydicom API abilities now.

However, it might be possible to create an 'alternate pydicom' (not fully compatible with the existing one), which started from these kinds of basic structures and worked around those other problems. I have often thought about trying to toy with these ideas (even as I write all this I'm thinking through ways to get back much of pydicom functionality), but that would be a massive undertaking.

@darcymason
Copy link
Member

So using simple python classes led to about ~factor 600 speed-up. I thought maybe I had done something wrong, but I added some print statements, checked the len of the sequence, etc. and the original data structure does seem to be there

So a change to this - I realized for that test I had reverted to current pydicom, which still had the bug of setting .parent throughout the nested structure when a sequence is retrieved (only once in the iterator run, but still a costly operation). When I removed that, I am getting about factor 30 speedup by using simple Python classes for Dataset and Sequence in this restricted example. That seems more reasonable.

@hackermd
Copy link
Contributor

That's an impressive speedup! It's also good to know that a pure Python implementation can be fast.

So a change to this - I realized for that test I had reverted to current pydicom, which still had the bug of setting .parent throughout the nested structure when a sequence is retrieved (only once in the iterator run, but still a costly operation).

What changes did you make? Could you please share those?

When I just comment out the following the following lines

pydicom/pydicom/sequence.py

Lines 118 to 121 in 01ae3fc

if value != self._parent:
self._parent = weakref.ref(value)
for item in self._list:
item.parent = self._parent

I am observing a drastic speedup:

create dataset...
iterate over PFFG sequence items...
         345427 function calls in 0.428 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.426    0.426 :0(exec)
    18841    0.009    0.000    0.009    0.000 :0(get)
   150728    0.094    0.000    0.094    0.000 :0(isinstance)
     6281    0.003    0.000    0.003    0.000 :0(len)
        1    0.002    0.002    0.002    0.002 :0(setprofile)
        1    0.000    0.000    0.426    0.426 <string>:1(<module>)
    18841    0.019    0.000    0.028    0.000 datadict.py:459(tag_for_keyword)
    31403    0.016    0.000    0.016    0.000 dataelem.py:430(value)
    18841    0.072    0.000    0.398    0.000 dataset.py:806(__getattr__)
    18841    0.086    0.000    0.176    0.000 dataset.py:854(__getitem__)
     6280    0.004    0.000    0.004    0.000 multival.py:101(__getitem__)
     6281    0.003    0.000    0.003    0.000 multival.py:109(__iter__)
     6281    0.006    0.000    0.009    0.000 multival.py:112(__len__)
        1    0.000    0.000    0.428    0.428 profile:0(iterate(dataset))
        0    0.000             0.000          profile:0(profiler)
     6281    0.003    0.000    0.003    0.000 sequence.py:111(parent)
    37682    0.041    0.000    0.059    0.000 tag.py:176(__eq__)
    18841    0.049    0.000    0.083    0.000 tag.py:42(Tag)
        1    0.021    0.021    0.426    0.426 test.py:8(iterate)


index PFFG sequence items...
         477301 function calls in 0.585 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.585    0.585 :0(exec)
    25121    0.012    0.000    0.012    0.000 :0(get)
   200968    0.126    0.000    0.126    0.000 :0(isinstance)
    12560    0.006    0.000    0.006    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.585    0.585 <string>:1(<module>)
    25121    0.025    0.000    0.038    0.000 datadict.py:459(tag_for_keyword)
    50241    0.025    0.000    0.025    0.000 dataelem.py:430(value)
    25121    0.095    0.000    0.552    0.000 dataset.py:806(__getattr__)
    25121    0.123    0.000    0.255    0.000 dataset.py:854(__getitem__)
    12560    0.007    0.000    0.007    0.000 multival.py:101(__getitem__)
    12560    0.013    0.000    0.019    0.000 multival.py:112(__len__)
        1    0.000    0.000    0.585    0.585 profile:0(index(dataset))
        0    0.000             0.000          profile:0(profiler)
    12560    0.006    0.000    0.006    0.000 sequence.py:111(parent)
    50242    0.054    0.000    0.079    0.000 tag.py:176(__eq__)
    25121    0.066    0.000    0.112    0.000 tag.py:42(Tag)
        1    0.026    0.026    0.585    0.585 test.py:17(index)

As you have also stated, the parent setter appears the major bottleneck and should be improved/avoided.

This has the effect of removing pydicom's overhead, including getattr, getitem etc., to speak to your points about those earlier perhaps being costly.

@CPBridge and I have thought about ways to reduce the overhead. One idea was to implement DataElement using __slots__, which could speed up low-level data access operations.

When I simply change the following lines

pydicom/pydicom/dataset.py

Lines 824 to 828 in 01ae3fc

tag = tag_for_keyword(name)
if tag is not None: # `name` isn't a DICOM element keyword
tag = Tag(tag)
if tag in self._dict: # DICOM DataElement not in the Dataset
return self[tag].value

to

        try:
            tag = tag_for_keyword(name)
            return self[tag].value
        except KeyError:
            pass

I am already getting a noticeable additional speedup:

create dataset...
iterate over PFFG sequence items...
         307745 function calls in 0.381 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.379    0.379 :0(exec)
    18841    0.009    0.000    0.009    0.000 :0(get)
   131887    0.084    0.000    0.084    0.000 :0(isinstance)
     6281    0.003    0.000    0.003    0.000 :0(len)
        1    0.002    0.002    0.002    0.002 :0(setprofile)
        1    0.000    0.000    0.379    0.379 <string>:1(<module>)
    18841    0.019    0.000    0.028    0.000 datadict.py:459(tag_for_keyword)
    31403    0.016    0.000    0.016    0.000 dataelem.py:430(value)
    18841    0.045    0.000    0.352    0.000 dataset.py:807(__getattr__)
    18841    0.098    0.000    0.270    0.000 dataset.py:855(__getitem__)
     6280    0.003    0.000    0.003    0.000 multival.py:101(__getitem__)
     6281    0.003    0.000    0.003    0.000 multival.py:109(__iter__)
     6281    0.006    0.000    0.009    0.000 multival.py:112(__len__)
        1    0.000    0.000    0.381    0.381 profile:0(iterate(dataset))
        0    0.000             0.000          profile:0(profiler)
     6281    0.003    0.000    0.003    0.000 sequence.py:111(parent)
    18841    0.020    0.000    0.029    0.000 tag.py:176(__eq__)
    18841    0.050    0.000    0.084    0.000 tag.py:42(Tag)
        1    0.020    0.020    0.379    0.379 test.py:8(iterate)


index PFFG sequence items...
         427059 function calls in 0.522 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.522    0.522 :0(exec)
    25121    0.012    0.000    0.012    0.000 :0(get)
   175847    0.111    0.000    0.111    0.000 :0(isinstance)
    12560    0.006    0.000    0.006    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.522    0.522 <string>:1(<module>)
    25121    0.025    0.000    0.037    0.000 datadict.py:459(tag_for_keyword)
    50241    0.024    0.000    0.024    0.000 dataelem.py:430(value)
    25121    0.060    0.000    0.489    0.000 dataset.py:807(__getattr__)
    25121    0.138    0.000    0.380    0.000 dataset.py:855(__getitem__)
    12560    0.007    0.000    0.007    0.000 multival.py:101(__getitem__)
    12560    0.013    0.000    0.019    0.000 multival.py:112(__len__)
        1    0.000    0.000    0.522    0.522 profile:0(index(dataset))
        0    0.000             0.000          profile:0(profiler)
    12560    0.006    0.000    0.006    0.000 sequence.py:111(parent)
    25121    0.027    0.000    0.039    0.000 tag.py:176(__eq__)
    25121    0.065    0.000    0.111    0.000 tag.py:42(Tag)
        1    0.026    0.026    0.522    0.522 test.py:17(index)

The repeated dictionary lookup operations certainly add up and there seems to be room for improvement at several places, e.g., in Dataset.__eq__, Dataset.__contains__, etc.

However, moving most operations into CPython code vs. actual Python code would be expected to yield a big increase

I've thought about that for some time and we have started to develop a lightweight C library (https://github.com/hackermd/libdicom) that could be used for that purpose. The library doesn't have Python bindings yet, but that's on our roadmap. cc @jcupitt for awareness

We've also been brainstorming about using nuitka to compile the package into a C extension. However, it's unclear if and how much that would increase performance.

@jcupitt
Copy link

jcupitt commented Nov 18, 2022

I've thought about that for some time and we have started to develop a lightweight C library (https://github.com/hackermd/libdicom) that could be used for that purpose. The library doesn't have Python bindings yet, but that's on our roadmap. cc @jcupitt for awareness

Huh interesting discussion, thank you for pinging me.

Yes, we're planning to do a little work on @hackermd's libdicom, then make a python binding with cffi. In API mode, cffi will generate and compile the cpython / libdicom bridge at setup time, so on linux and macos at least, you get direct calls from python into native code. On win you need to generate the link code at runtime in python, so it's a bit slower, but hopefully still not too bad. It's the approach I've taken with pyvips, and it seems to work well.

This should all happen within the next six months, with a little luck around funding.

@darcymason
Copy link
Member

When I just comment out the following the following lines

pydicom/pydicom/sequence.py

Lines 118 to 121 in 01ae3fc

if value != self._parent:
self._parent = weakref.ref(value)
for item in self._list:
item.parent = self._parent

That's the section. I've branched locally and will do a PR on this. Commenting out is fine for this profiling example, but unfortunately code elsewhere does need those .parent attributes to be set. Figuring out how to do that correctly is not obvious - I know there were a couple of other attempts initially with various problems - I'll try to go back and find that PR/discussion and see why this parent setting landed here (my intuition would be that those should be set when the SQ data element is added to the Dataset, not when retrieved)

@CPBridge and I have thought about ways to reduce the overhead. One idea was to implement DataElement using __slots__, which could speed up low-level data access operations.

Yes, I've been thinking perhaps DataElement could be a Data Class with slots.

When I simply change the following lines

I suspect there are many cases where try/except would improve performance. Way back, the conventional wisdom was that setting up the exception frame was costly, but now it seems that it is faster to 'fail' sometimes and recover in Python than to check things ahead of time.

I've thought about that for some time and we have started to develop a lightweight C library (https://github.com/hackermd/libdicom) that could be used for that purpose. The library doesn't have Python bindings yet, but that's on our roadmap. cc @jcupitt for awareness

I wasn't aware of that - good to know. I toyed with some C code to read DICOM many years ago and never had the time to further pursue it. I think some targeted small parts could be beneficial to add as an optional import to pydicom (somewhat like there used to be some Python libs like StringIO/cStringIO). I've often wondered whether just defining a couple of well-defined objects (e.g. Tag, DataElement) in C could result in significant gains from a very small, highly portable C extension.

We've also been brainstorming about using nuitka

Nuitka is also new to me - looks interesting, let us know if anything comes of that. I'm more interested in performance improvements in the basic logic of pydicom like we have been discussing, but perhaps some changes can be made with an eye to being compatible with compilers like that.

Finally I will point out that pydicom does have a benchmark folder (that I believe @scaramallion set up years ago) where some speed tests were meant to be tracked using (IIRC) airspeed velocity, but ... we were busy with other things, I guess, so not much has happened with it since. It would be nice, however, to incorporate some kind of permanent performance checks like this, to catch things like the .parent bug in future.

@hackermd
Copy link
Contributor

@CPBridge and I have thought about ways to reduce the overhead. One idea was to implement DataElement using slots, which could speed up low-level data access operations.

Yes, I've #1232 (comment) perhaps DataElement could be a Data Class with slots.

I think that's a good idea.

When I simply change the following lines

I suspect there are many cases where try/except would improve performance. Way back, the conventional wisdom was that setting up the exception frame was costly, but now it seems that it is faster to 'fail' sometimes and recover in Python than to check things ahead of time.

I am not familiar with the CPython implementation details, but I had the impression that it can be faster to try and handle the KeyError for a small code block. Note that the change I made also removed additional statements and function calls, which may have contributed to the speedup.

I've thought about that for some time and we have started to develop a lightweight C library (https://github.com/hackermd/libdicom) that could be used for that purpose. The library doesn't have Python bindings yet, but that's on our roadmap. cc @jcupitt for awareness

I wasn't aware of that - good to know. I toyed with some C code to read DICOM many years ago and never had the time to further pursue it. I think some targeted small parts could be beneficial to add as an optional import to pydicom (somewhat like there used to be some Python libs like StringIO/cStringIO). I've often wondered whether just defining a couple of well-defined objects (e.g. Tag, DataElement) in C could result in significant gains from a very small, highly portable C extension.

Exactly! It may also help improve of the performance of data access methods, because some of the expensive operations (e.g., comparing two Dataset instances for equality) could be entirely performed in C.

@jcupitt not sure how we may want to map the C data structures to Python classes, but it would be nice for the libdicom Python bindings (I am going to call them cdicom for now) to implement the pydicom API for key data structures and use the corresponding C types and functions under the hood:

  • pydicom.DataElement -> cdicom.DataElement using DcmElement
  • pydicom.Dataset -> cdicom.Dataset using DcmDataSet
  • pydicom.Sequence -> cdicom.Sequence using DcmSequence

@darcymason it would probably help if we could more formally define the "public" pydicom API and clarify expectations regarding types like pydicom.MultiValue.

@jcupitt
Copy link

jcupitt commented Nov 18, 2022

I agree, an API compatibility layer would be a good plan. Hopefully we can just paint a thin skin of python over libdicom and support pydicom programs. Though I've not looked into the details yet, of course.

@darcymason
Copy link
Member

Hi all, I've just merged the parent fix and other changes from #1734 ... some good speed improvements there, especially the access by indexing is orders of magnitude better. The original file (ABD_LYMPH_008_SEG) read is ~20% faster, I think, the timing varies greatly from run to run.

#1734 has some graphs snipped from the asv outputs which give an idea of the speed-up, at least on my laptop.

There's probably lots more to be done, as we discussed, but I think this was a (relatively) easy start, and gets rid of the extreme slow-down when accessing nested sequences by index.

Please give it a try if you can and let us know if any other issues crop up.

@darcymason
Copy link
Member

I'm going to close this issue in favor of another one (eventually) - we solved some of the immediate problems here. Another level is a re-write of the core file reading code that I've done in an experimental (currently private) repo. But that has so many knock-on problems to integrate into pydicom, and at the moment not enough time to look into it. But will hopefully come back to it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants