Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update filereader.py #227

Closed
wants to merge 1 commit into from
Closed

Update filereader.py #227

wants to merge 1 commit into from

Conversation

parneshr
Copy link

Added chuncked reads when reading an element with fixed length to overcome issues with large data elements. In particular elements over 4Gb result in errant reads and when reading the same elements in smaller chuncks of saw 1Gb, no errors are detected. This should have no performance issues for standard size data elements.

Added chuncked reads when reading an element with fixed length to overcome issues with large data elements. In particular elements over 4Gb result in errant reads and when reading the same elements in smaller chuncks of saw 1Gb, no errors are detected. This should have no performance issues for standard size data elements.
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.06% when pulling 39a9908 on parneshr:master into 43f2784 on darcymason:master.

len_to_read = length - buf_size
while len_to_read > 0 :
if len_to_read > buf_size:
value = value + fp_read(buf_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something better than a string that can be used here?

@darcymason
Copy link
Member

I like this chunked-read idea in principle, but I echo cancan101's question about using a string. String concatenation would probably be very slow. Another problem is that the binary file read in python 3 returns a bytes object, so I think using a string will generate a type error in python 3.

There are a couple of other minor things: the landscape-bot new issues (whitespace problems -- just style issues), and the setting of buf_size should happen at the top of the generator, not inside the loop just to optimize speed and be a little cleaner.

@darcymason
Copy link
Member

Is the need for this removed if memmap (#139) is implemented?

@mrbean-bremen
Copy link
Member

I think that memmap could help to prevent this problem - you could just do the partial reads and handle the concatination outside of pydicom, and would avoid any unnneeded performance degradation. The only downside is that it still has to be implemented...
Also, I'm not entirely sure about the original problem - AFAIK, the maximum size of a DICOM element is 4GB, but here it was stated that it was larger than that.

@darcymason
Copy link
Member

AFAIK, the maximum size of a DICOM element is 4GB, but here it was stated that it was larger than that.

Agreed. There are only four bytes at most stored in a dicom file for Length, so max possible value length should be 4 GB.

@mrbean-bremen
Copy link
Member

I looked at this again, and I would add this only as an option (probably something like chunk_size) which defaults to None, e.g. no chunked read - if it is needed at all, which has to be decided.
The reason is that it seems not to be needed on most (64 bit) systems, and introduces additional memory and CPU load for large files.
For reading byte strings shall be used, as has been already noted, and instead of concatination the byte strings could be collected in a list and joined at the end - this seems to be the fastet way according to stack overflow (though it probably will double the memory usage during joining - haven't checked this).

@darcymason
Copy link
Member

Trying to clean up old issues ... this relates to #222. On reading this again, I remembered that length can be "undefined length" so maybe that allows more than 4 GB (although the original post talks about fixed length elements).

To close this, I think we need a solution for the string questions (does BytesIO work more efficiently?), or a memmap solution.

@mrbean-bremen
Copy link
Member

@darcymason - I think this one can be closed, because the problem can be handled by reading frame-per-frame, once the respective PR has been merged. A possible memmap feature would also help, as discussed.

@darcymason
Copy link
Member

@mrbean-bremen, agreed. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants