-
-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking ideas for performance improvements with large segmentation files #1728
Comments
Here is a dumb profile of the dcmread call on the file using the import profile
import pydicom
profile.run("pydicom.dcmread('ABD_LYMPH_008_SEG.dcm')") Output:
|
Thanks for the analysis! From your profiling, it looks like This is all guesswork until more profiling has been done, but it is certainly something to be looked at. |
I enjoy optimization problems, and find it very satisfying when improvements are added to pydicom, so I welcome any speed-ups that can come out of this, assuming they don't overly complicate code or slow down other types of reading. I'm downloading the file and will have a closer look when I can. Meanwhile, one question is about use cases: the optimization may well depend on how many of these nested items you need to access. If all of them, I'm not sure there is much that can be done, the cost will have to be paid at some point. In other words, it might just come down to the serial nature of DICOM and having to read through to get at any particular data element. Also, do you have any comparison of time to read with other DICOM toolkits? |
Thanks @darcymason and @mrbean-bremen 👍 I agree, this can make for an interesting programming challenge, but also we feel it's important to characterize the performance of this use case in order to see how the standard will scale. As @CPBridge points, out, this particular SEG is large, but the CT is of a pretty standard clinical size and the ML tool we are using, TotalSegmentator, turns out to be very robust at generating these segmentations on pretty much any CT I've given it (not always perfect, but pretty remarkable really).
We have started experimenting in OHIF where we use dcmjs and in Slicer, where we currently use dcmqi, built on dcmtk. @dclunie is also testing with his java tools, PixelMed I believe, but we can verify. Too early to make definitive statements, but short answer is that they all seem slower than we'd like them to be. As one might expect, each language and design decisions bring their own tradeoffs, and the optimization approaches may differ. I'd think we can define a common set of operations to compare the implementations. Something like: time to read header, time to unpack pixel data, time to assemble the segments into a labelmap. In addition to timings there are also memory footprint issues depending on how you want to use the pixel data (e.g. one common way to access the segmentation in highdicom returns 100 volumes of 512x512x643). I'm hoping we can use this kind of cross-toolkit research not only to improve performance (and possibly the standard itself) but also to test interoperability of these derived objects. |
Thanks @darcymason and @mrbean-bremen for your help in trying to get to the bottom of it. @CPBridge and @pieper already nicely described the use case for which we need to optimize performance. I would just like to add that are additional use cases. In digital pathology, VL Whole Slide Microscopy Image instances with Playing around with a test dataset, I made some interesting observations: from datetime import datetime
import highdicom as hd
from pydicom.dataset import Dataset
from pydicom.uid import VLWholeSlideMicroscopyImageStorage
if __name__ == '__main__':
print('create dataset...')
dataset = Dataset()
dataset.SOPClassUID = VLWholeSlideMicroscopyImageStorage
dataset.ImageOrientationSlide = [1, 0, 0, 0, 1, 0]
dataset.Rows = 256
dataset.Columns = 256
dataset.TotalPixelMatrixRows = 10000
dataset.TotalPixelMatrixColumns = 40000
ops_item = Dataset()
dataset.OpticalPathSequence = [ops_item]
tpmos_item = Dataset()
tpmos_item.XOffsetInSlideCoordinateSystem = 0.0
tpmos_item.YOffsetInSlideCoordinateSystem = 0.0
dataset.TotalPixelMatrixOriginSequence = [tpmos_item]
sfgs_item = Dataset()
pms_item = Dataset()
pms_item.PixelSpacing = [0.0001, 0.0001]
sfgs_item.PixelMeasuresSequence = [pms_item]
dataset.SharedFunctionalGroupsSequence = [sfgs_item]
dataset.PerFrameFunctionalGroupsSequence = []
for pps in hd.utils.compute_plane_position_slide_per_frame(dataset):
ppfgs_item = Dataset()
ppfgs_item.PlanePositionSequence = pps
dataset.PerFrameFunctionalGroupsSequence.append(ppfgs_item)
dataset.NumberOfFrames = str(len(dataset.PerFrameFunctionalGroupsSequence))
print('iterate over PFFG sequence items...')
start = datetime.now()
for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
pps_item = pffgs_item.PlanePositionSequence[0]
(
pps_item.RowPositionInTotalImagePixelMatrix,
pps_item.ColumnPositionInTotalImagePixelMatrix,
)
end = datetime.now()
diff = end - start
print(f'iteration took {diff}')
print('iterate over PFFG sequence item indices...')
start = datetime.now()
for i in range(int(dataset.NumberOfFrames)):
pffgs_item = dataset.PerFrameFunctionalGroupsSequence[i]
pps_item = pffgs_item.PlanePositionSequence[0]
(
pps_item.RowPositionInTotalImagePixelMatrix,
pps_item.ColumnPositionInTotalImagePixelMatrix,
)
end = datetime.now()
diff = end - start
print(f'iteration took {diff}')
It appears that indexing into large sequences is a major performance bottleneck. |
Hm, as I wrote, the sequence items are first read in a blob for performance reasons (so that they can just be written back if needed), but at the moment values in the sequence item are accessed, they are parsed and
|
Note that in the above example no data is actually read from any file. The dataset is entirely created in memory. |
Right, that makes it really interesting then... I'll try to take a closer look at this later today. |
I've updated the example to also profile the two for loops: import profile
import highdicom as hd
from pydicom.dataset import Dataset
from pydicom.uid import VLWholeSlideMicroscopyImageStorage
def iterate(dataset):
for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
pps_item = pffgs_item.PlanePositionSequence[0]
(
pps_item.RowPositionInTotalImagePixelMatrix,
pps_item.ColumnPositionInTotalImagePixelMatrix,
)
def index(dataset):
for i in range(int(dataset.NumberOfFrames)):
pffgs_item = dataset.PerFrameFunctionalGroupsSequence[i]
pps_item = pffgs_item.PlanePositionSequence[0]
(
pps_item.RowPositionInTotalImagePixelMatrix,
pps_item.ColumnPositionInTotalImagePixelMatrix,
)
if __name__ == '__main__':
print('create dataset...')
dataset = Dataset()
dataset.SOPClassUID = VLWholeSlideMicroscopyImageStorage
dataset.ImageOrientationSlide = [1, 0, 0, 0, 1, 0]
dataset.Rows = 256
dataset.Columns = 256
dataset.TotalPixelMatrixRows = 10000
dataset.TotalPixelMatrixColumns = 40000
ops_item = Dataset()
dataset.OpticalPathSequence = [ops_item]
tpmos_item = Dataset()
tpmos_item.XOffsetInSlideCoordinateSystem = 0.0
tpmos_item.YOffsetInSlideCoordinateSystem = 0.0
dataset.TotalPixelMatrixOriginSequence = [tpmos_item]
sfgs_item = Dataset()
pms_item = Dataset()
pms_item.PixelSpacing = [0.0001, 0.0001]
sfgs_item.PixelMeasuresSequence = [pms_item]
dataset.SharedFunctionalGroupsSequence = [sfgs_item]
dataset.PerFrameFunctionalGroupsSequence = []
for pps in hd.utils.compute_plane_position_slide_per_frame(dataset):
ppfgs_item = Dataset()
ppfgs_item.PlanePositionSequence = pps
dataset.PerFrameFunctionalGroupsSequence.append(ppfgs_item)
dataset.NumberOfFrames = str(len(dataset.PerFrameFunctionalGroupsSequence))
print('iterate over PFFG sequence items...')
profile.run('iterate(dataset)')
print('index PFFG sequence items...')
profile.run('index(dataset)')
|
It's interesting that Lines 120 to 121 in a8be738
|
One thing that jumps out at me is the tremendous increase in calls to
def index(dataset):
func_group_seq = dataset.PerFrameFunctionalGroupsSequence
for i in range(int(dataset.NumberOfFrames)):
pffgs_item = func_group_seq[i]
... and return keyword_dict.get(keyword) it could just be in-lined in dataset.py to save function call overhead. That's only a part of the problem, but on my PC the index-run is taking so long, I'm still waiting for it... I'll need a simpler example to work with to dig deeper. |
@darcymason here are a multiframe MR and corresponding SEG that may be good for testing. We can also dig up even smaller ones if that would help. https://s3.amazonaws.com/IsomicsPublic/SampleData/MRHead/MRHead-multiframe.dcm https://s3.amazonaws.com/IsomicsPublic/SampleData/MRHead/aseg-t1space.dcm |
Probs going to be okay. So far moving that lookup outside the loop has taken almost all the time difference away. I'll post results shortly, just double-checking I didn't change something else. I'm guessing it is related to parent setting @hackermd mentioned - probably causing a recursive cascade each time through the loop somehow... but I'll keep digging. |
Okay, confirmed. I'm getting different times on each run, but the index one is ~10% longer, sometimes the same or even less, if I add a single line to pydicom to avoid setting the Sequence parent if it is already set. So ... definitely a bug in pydicom related to setting its parent when returning a Sequence. In the iterator, the sequence is only looked up once, likewise if the I'm not sure the one-line solution I did is the best answer, I'll think it out a bit more and put in a PR in the next couple of days. I don't know if this helps at all with the performance issues raised earlier in this issue - probably not, but we can keep looking. |
Thanks so much for looking into the issue @darcymason.
Good point! This is certainly helpful and we will go through the highdicom code base to identify statements that could be improved in this regard.
Not sure either. It would probably be worth carefully reviewing the implementation of the |
Another thought/observation on @hackermd's recent iterating example: Another way to move something out of the loop is to avoid the repeated keyword->tag lookup, followed by return the from pydicom.tag import Tag
def iterate_preload_tags(dataset):
row_pos_tag = Tag("RowPositionInTotalImagePixelMatrix")
col_pos_tag = Tag("ColumnPositionInTotalImagePixelMatrix")
for pffgs_item in dataset.PerFrameFunctionalGroupsSequence:
pps_item = pffgs_item.PlanePositionSequence[0]
(
pps_item[row_pos_tag].value,
pps_item[col_pos_tag].value,
) When I profile this against the other iterator, I see ~20% speed improvement. You lose a little in code readability (although you could use the full keyword for the tag name if you wanted), but in really critical areas where a few tags are used repeatedly, this could help a bit. |
Thanks for investigating further @darcymason. I think it's a good idea to rely on the lower-level interface and use the tags directly for repeated access of data elements in loops. Do you think it would help if the |
It would help a little, but not significantly, I don't think. I always feel that when needing speed-ups, one really wants at least factor of 2, or even better factor 10. So I tried something very interesting, just to get an idea of the ultimate speed this little bit of code could reach in Python:
class Dataset:
pass
class FileMetaDataset:
pass
Sequence = list
This has the effect of removing pydicom's overhead, including (Found mistake, see next post)
Of course, this only works in this limited example. These simple structures would not handle private data elements, or writing out the file again, so many other pydicom API abilities now. However, it might be possible to create an 'alternate pydicom' (not fully compatible with the existing one), which started from these kinds of basic structures and worked around those other problems. I have often thought about trying to toy with these ideas (even as I write all this I'm thinking through ways to get back much of pydicom functionality), but that would be a massive undertaking. |
So a change to this - I realized for that test I had reverted to current pydicom, which still had the bug of setting |
That's an impressive speedup! It's also good to know that a pure Python implementation can be fast.
What changes did you make? Could you please share those? When I just comment out the following the following lines Lines 118 to 121 in 01ae3fc
I am observing a drastic speedup:
As you have also stated, the parent setter appears the major bottleneck and should be improved/avoided.
@CPBridge and I have thought about ways to reduce the overhead. One idea was to implement When I simply change the following lines Lines 824 to 828 in 01ae3fc
to try:
tag = tag_for_keyword(name)
return self[tag].value
except KeyError:
pass I am already getting a noticeable additional speedup:
The repeated dictionary lookup operations certainly add up and there seems to be room for improvement at several places, e.g., in
I've thought about that for some time and we have started to develop a lightweight C library (https://github.com/hackermd/libdicom) that could be used for that purpose. The library doesn't have Python bindings yet, but that's on our roadmap. cc @jcupitt for awareness We've also been brainstorming about using nuitka to compile the package into a C extension. However, it's unclear if and how much that would increase performance. |
Huh interesting discussion, thank you for pinging me. Yes, we're planning to do a little work on @hackermd's libdicom, then make a python binding with cffi. In API mode, cffi will generate and compile the cpython / libdicom bridge at setup time, so on linux and macos at least, you get direct calls from python into native code. On win you need to generate the link code at runtime in python, so it's a bit slower, but hopefully still not too bad. It's the approach I've taken with pyvips, and it seems to work well. This should all happen within the next six months, with a little luck around funding. |
That's the section. I've branched locally and will do a PR on this. Commenting out is fine for this profiling example, but unfortunately code elsewhere does need those
Yes, I've been thinking perhaps DataElement could be a Data Class with slots.
I suspect there are many cases where
I wasn't aware of that - good to know. I toyed with some C code to read DICOM many years ago and never had the time to further pursue it. I think some targeted small parts could be beneficial to add as an optional import to pydicom (somewhat like there used to be some Python libs like StringIO/cStringIO). I've often wondered whether just defining a couple of well-defined objects (e.g. Tag, DataElement) in C could result in significant gains from a very small, highly portable C extension.
Nuitka is also new to me - looks interesting, let us know if anything comes of that. I'm more interested in performance improvements in the basic logic of pydicom like we have been discussing, but perhaps some changes can be made with an eye to being compatible with compilers like that. Finally I will point out that pydicom does have a |
I think that's a good idea.
I am not familiar with the CPython implementation details, but I had the impression that it can be faster to try and handle the
Exactly! It may also help improve of the performance of data access methods, because some of the expensive operations (e.g., comparing two @jcupitt not sure how we may want to map the C data structures to Python classes, but it would be nice for the libdicom Python bindings (I am going to call them
@darcymason it would probably help if we could more formally define the "public" pydicom API and clarify expectations regarding types like |
I agree, an API compatibility layer would be a good plan. Hopefully we can just paint a thin skin of python over libdicom and support pydicom programs. Though I've not looked into the details yet, of course. |
Hi all, I've just merged the parent fix and other changes from #1734 ... some good speed improvements there, especially the access by indexing is orders of magnitude better. The original file (ABD_LYMPH_008_SEG) read is ~20% faster, I think, the timing varies greatly from run to run. #1734 has some graphs snipped from the asv outputs which give an idea of the speed-up, at least on my laptop. There's probably lots more to be done, as we discussed, but I think this was a (relatively) easy start, and gets rid of the extreme slow-down when accessing nested sequences by index. Please give it a try if you can and let us know if any other issues crop up. |
I'm going to close this issue in favor of another one (eventually) - we solved some of the immediate problems here. Another level is a re-write of the core file reading code that I've done in an experimental (currently private) repo. But that has so many knock-on problems to integrate into pydicom, and at the moment not enough time to look into it. But will hopefully come back to it at some point. |
Hi,
A group of us in the highdicom community (@hackermd @pieper @fedorov) have been playing around with large DICOM Segmentation files and have found some performance issues with pydicom that are affecting us quite badly. We are happy to contribute to pydicom to try and fix these issues but would appreciate any thoughts/guidance as to what the most likely avenues for improvements may be.
Specifically we are trying to parse segmentations with large numbers of segments. See for example this file shared by @pieper, which is a segmentation of a CT scan with 98 organs of interest segmented in 643 slices giving over 12000 frames in the segmetation dataset. So certainly large but something that may occur relatively often in practice. Simply calling
dcmread
on this file from the local filesystem takes over 5s on my latest gen arm64 macbook, which we feel is much slower than desirable for many applications (obviously times on lower end hardware are worse, it was taking @pieper around 15s in a colab notebook). The file is huge and obviously the bulk of this is pixel data, however, to our surprise, specifyingstop_before_pixels=True
does not significantly affect the read time. We conclude that the bottleneck is related to reading in the highly nested metadata, most likely dominated by the items of PerFrameFunctionalGroupsSequence, which contains over 12000 items, each of which is a nested dataset with a relatively small number of elements within it.Does anyone in the pydicom community have any idea how/whether we may be able to make improvements to pydicom to bring this down to something more practical?
See also this related highdicom issue (note that we are also exploring other means to improve the general issue of highdicom's slow perforamance with large segmetnations, including improvements to highdicom and possibly changes to the standard, but here we are narrowly interested in what can be done about this slow dcmread).
Thanks in advance for any suggestions!
The text was updated successfully, but these errors were encountered: