Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default read_size in read_undefined_length_value() in fileutil.py is too small and slows down reading large DICOM files #436

Closed
mshunshin opened this issue Jul 24, 2017 · 4 comments

Comments

@mshunshin
Copy link
Contributor

read_undefined_length_value() has a default read size of 128 bytes.

This gets called once when reading large files with large amounts of PixelData (e.g. 70MB) causes it to take ~2s to open file. When changing to say 1024, or 8192 speeds up dramatically.

E.g a 69 MB file, loaded from a Samsung EVO 830 SSD via USB3.

import pydicom
file_fl = '/Volumes/Matt/massive_dicom_file'

### With read_size at 1024
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=False)
224 ms ± 95.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

### With read_size at 128
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=False)
2.38 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What do you think the default should be?

Actually, it is a shame that in python opening a file as binary e.g.

with open(file_fl, "rb") as f:
    pass

doesn't give you a find function i.e.

f.find()

and we have to do this awkward read, check for EOF, read again in case, append, find etc.

If you read it as

import mmap
with open(file_fl, "r+b") as file_f:
    with mmap.mmap(file_f.file_no(), 0) as file_m

you get a find function on file_m and it saves you all this code, and is much faster.

I have patched my local version to use mmap; but don't know if this is a reasonable thing for mainline.

Matt

@mrbean-bremen
Copy link
Member

There is a somewhat related issue (#139), which deals with reading pixel data via numpy.mmap.
I think your change makes a lot of sense (especially for pixel data with unknown length), so I would propose you create a PR for that.
cc @darcymason.

@darcymason
Copy link
Member

I'm +1. I'm not sure where the original 128 came from, certainly higher makes more sense. And mmap sounds good, but is there a performance hit for reading the whole file that way? Or is it reasonable to open only when needing to read larger values and skip to the right offset?

How does an mmap solution handle reading a set of dicom files and dealing with the values after the file is closed?

@mrbean-bremen
Copy link
Member

Maybe we could just change the default size for now (4096 would be my pick), and handle the mmap part later together with #139.

@mshunshin
Copy link
Contributor Author

Experiments using the same SSD above, this time with a smaller file (15MB).
Using larger reads up to 8 to 16Kb continues to improve performance without harming stop_before_pixels.

Can we go for 8KB?

Will send a PR

Read: 128

%timeit data = pydicom.read_file(file_fl)
446 ms ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data = pydicom.read_file(file_fl)
469 ms ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.57 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.58 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Read: 1024

%timeit data = pydicom.read_file(file_fl)
87.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl)
89.7 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.56 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
6.01 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Read: 4 * 1034

%timeit data = pydicom.read_file(file_fl)
52.2 ms ± 6.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data = pydicom.read_file(file_fl)
55.2 ms ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
5.32 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
5.5 ms ± 853 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Read: 8 * 1024

%timeit data = pydicom.read_file(file_fl)
43.3 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data = pydicom.read_file(file_fl)
45.7 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.58 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.48 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Read: 16 * 1024
%timeit data = pydicom.read_file(file_fl)
39.3 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl)
39.7 ms ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.33 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
5.34 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Read: 64 * 1024

%timeit data = pydicom.read_file(file_fl)
39.3 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl)
38.7 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
5.04 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pydicom.read_file(file_fl, stop_before_pixels=True)
4.6 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants