Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Support buffers for OB DataElement Values #1919

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

gnottobfly
Copy link

@gnottobfly gnottobfly commented Oct 25, 2023

Support buffers for OB DataElement Values

Putting up an early PR to get some early feedback before investing too much time. Early comments on the approach or API welcomed!

This allows setting a DataElement's value to a buffer. The motivation behind this is to write a dataset to disk without copying all of a large image/video into memory to do so. See #1913 for the discussion that led to this. For now I'm only targeting byte values as those are the ones that are expected to be large.

Describe the changes

  • Allows assinging the DataElement's value to a BufferedIOBase
  • Adds a new method value_generator that yields chunks from the buffer

Tasks

  • Unit tests added that reproduce the issue or prove feature is working
  • Fix or feature added
  • Code typed and mypy shows no errors
  • Documentation updated (if relevant)
    • No warnings during build
    • Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
  • Unit tests passing and overall coverage the same or better

* Allows assinging the DataElement's value to a BufferedIOBase
* Adds a new method value_generator that yields chunks from the buffer
@gnottobfly gnottobfly changed the title Support buffers for OB DataElement Values [WIP] Support buffers for OB DataElement Values Oct 25, 2023
@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #1919 (f47c4bf) into main (d59c339) will increase coverage by 0.01%.
Report is 4 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head f47c4bf differs from pull request most recent head 894fff8. Consider uploading reports for the commit 894fff8 to get more accurate results

@@            Coverage Diff             @@
##             main    #1919      +/-   ##
==========================================
+ Coverage   97.74%   97.75%   +0.01%     
==========================================
  Files          63       64       +1     
  Lines       10739    10802      +63     
==========================================
+ Hits        10497    10560      +63     
  Misses        242      242              
Files Coverage Δ
src/pydicom/dataelem.py 98.50% <100.00%> (+0.07%) ⬆️
src/pydicom/filewriter.py 98.11% <100.00%> (+0.10%) ⬆️
src/pydicom/util/buffers.py 100.00% <100.00%> (ø)
src/pydicom/valuerep.py 99.27% <100.00%> (+<0.01%) ⬆️

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

Copy link
Member

@darcymason darcymason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quick initial review ... I've posted a few comments but will need to absorb this for a while longer for the bigger structural issues.

I'm trying to think through how this would be used in practice. For example, here the full value is read in if value is accessed, but that seems to defeat the purpose. However, if in practice people just substitute in a generator value for a 'dummy' small value just before writing then maybe that can be avoided.

I thought briefly about suggesting a subclass DataElementBuffered rather than updating a bunch of code within DataElement, but after a little thought I'm thinking one class is probably better.

I'll try to go over a little more in the next couple of days.

"""Input parameter used to store how many bytes have been read from
a buffered value"""
bytes_read: int = 0


def __init__(
self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On BufferInfo class, why not just store bytes_read directly in DataElement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the value_generator function having a return type of the # of bytes read so thought an output parameter made sense here. I think its most likely unnecessary.

@@ -160,6 +162,20 @@ class DataElement:
maxBytesToDisplay = 16
showVR = True
is_raw = False
is_buffered = False
"""
Tracks whether the data element value is a buffer. If so, calling self.value will read the entire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are class attributes meant to apply to all instances of the class (is_raw is used for all instances, because RawDataElement sets its own is_raw).

In any case, I think it better to make it a read-only property is_buffered, which checks the type of _value. Then they can never get out of synch. E.g. see my comment on changing the value property setter.

# handle a buffered value - ie, a buffer pointing to a file on disk
if isinstance(value, BufferedIOBase):
if VR not in VR_.OB:
raise ValueError(f"Invalid VR: {self.VR}. Only the OB VR supports buffered values.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just be != here instead of not in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this with a tuple but good catch!

@@ -449,6 +482,48 @@ def value(self, val: Any) -> None:

self._value = self._convert_value(val)


def value_generator(self, *, buffer_info: BufferInfo = None, chunk_size = 1024) -> Iterator[bytes]:
"""Consume data from a buffered value. The value must be a buffer, the buffer must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably at least 8K chunk default is better. We had this come up with pydicom's undefined length reads before (#436).

But I'm also now wondering whether this generator is needed. Any code using buffered reading can just directly use value after checking it is a buffered reader. I need to ponder a little more ...

if self.is_buffered:
self._value = cast(BufferedIOBase, self._value).read()
self.is_buffered = False

return self._value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to update the value setter code as well - value could be set to a buffer after initial creation. The logic to check that on __init__ probably belongs in the value setter, actually, where it can handle both cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking - will update.

@gnottobfly
Copy link
Author

Thanks for the initial feedback! I had some thinking about subclassing it as well but felt like it would end up leading to more complicated code.

The reason I went with the value getter reading the entire buffer is because I wasn't sure if it made sense to make all code in the codebase buffer aware or maybe to allow more of a gradual upgrade for support. For our particular use case since we only need it to write the value to the file, it touches a pretty small amount of code. But what should be the expected behavior if you assign PixelData to a buffer and then try to compress/decompress it or access pixel_array for instance?

* use @Property for dynamic is_buffered
* use value setter to handle buffered value
* remove BufferInfo class
* remove default buffer read behavior
@gnottobfly
Copy link
Author

After thinking about it a bit more, I think you are right on not reading the full buffer by default as that'll be hard to code against and inevitably may lead to reading in the whole thing by accident. From your perspective as a maintainer, is it okay if certain operations raise an exception if the code isn't setup to deal with a buffered value?

# memory usage is in bytes
limit = MEGABYTE * 400
else:
pytest.skip("This test is not setup to run on this platform")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I need to find what the other platform's unit is for this test. I'm also open to other implementation's of this test but it was the only way I could think of to ensure we didn't read the file into memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs to be done?

Comment on lines 657 to 660
curr_pos = fp.tell()
fp.seek(vr_length_seek_pos)
fp.write_UL(elem.bytes_read_from_buffer)
fp.seek(curr_pos)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels pretty hacky - I'd love to get some thoughts on how to handle this. I'm stuck on how to handle writing the length of data before we know the length of data without having to backtrack without calling stat on the original file the buffer is from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer. Also note that we don't want to seek backwards at all as this breaks certain use cases where fp is a stream or a zip file I think (I don't remember what use case it was exactly).

Edit: Ok, the problem was with writing gzip file-like (see #753). There was also an old issue about writing to a stream (#154).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer

The problem is that I'm not writing the data to the buffer but instead to the file directly. The goal is to avoid reading the entire buffered element into memory and instead to transfer the data directly from the element (which contains a buffered reader) to the dicom file.

So this means that we don't know the length until we read all of the data (and write it to the file).

For a little context, the goal is to avoid reading large uncompressed files into memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Ok, the problem was with writing gzip file-like (see #753). There was also an old issue about writing to a stream (#154).

That's quite the conundrum... any thoughts on how to tackle it then without reading the entirety of PixelData into a memory buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I'm not writing the data to the buffer but instead to the file directly.

A right, I should have seen that - thanks!

That's quite the conundrum... any thoughts on how to tackle it then without reading the entirety of PixelData into a memory buffer?

Right now I have no idea. May have to think about it... How easy would it be to configure that no buffering is used? That would still allow the use case where seeking back is not possible (in case we find no other solution).

@gnottobfly
Copy link
Author

@darcymason @mrbean-bremen I was thinking about the seeking issue and I think I have a solution. Instead of supporting any IO buffer, I think it would make sense to limit it to a FileIO buffer. This would allow us to call stat on the file name to get the length. If the goal is to keep large files out of memory (which it is for us), then it doesn't really make sense to support in memory streams anyways.

If someone would like to read pixel data from a non-seekable stream like a socket, then they can write the file to a temporary file first.

It'd also make the code a little easier to manage as we wouldn't need to monkey around with keeping track of the first 4 bytes to check for encapsulation. I'll make some changes to see how that would look on Monday or Tuesday but let me know what y'all think.

@darcymason
Copy link
Member

I think it would make sense to limit it to a FileIO buffer. This would allow us to call stat on the file name to get the length. If the goal is to keep large files out of memory (which it is for us), then it doesn't really make sense to support in memory streams anyways.

I was also thinking of saying we just start with a seek-able file handle anyway, it is still a big step ahead from where we are now.

I'm not sure about using stat on the file for the length, though. Someone may be sourcing their info from part-way through another file (e.g. a different DICOM file). Maybe the read should start within the open file handle where it is positioned, and if it doesn't read until end (length can be determined by seeking end, and tell()ing, then seeking back), then the caller would need to provide the length.

@mrbean-bremen
Copy link
Member

I was also thinking of saying we just start with a seek-able file handle anyway, it is still a big step ahead from where we are now.

Completely agree, thanks!

* move buffer read code to a utils file
* create util to reset buffer position
* glean buffered value length from using buffer.tell and seeking
* fix existing tests
# a known tag shall only have the VR 'UN' if it has a length that
# exceeds the size that can be encoded in 16 bit - all other cases
# can be seen as an encoding error and can be corrected
if (
VR == VR_.UN
and not tag.is_private
and config.replace_un_with_known_vr
# we don't know the length and such can't determine if the length exceeds
# 0xFFFF
and not is_buffered
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should I call buffer_length to get the length here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line

@darcymason
Copy link
Member

'main' is the right one to target.

Can I backport this to a 2.x release once it gets accepted? Just thinking about how to get this into our codebase without having to wait for a 3.0.0 release and then upgrade all of our code.

Maybe I could be persuaded, but I'd like to stay with the philosophy that 2.x is in major-bug-fix-only mode, we have enough on our plates dealing with PRs and issues as it is. Would you not be able to patch 2.X locally and set up pip installs to your own local code-base? (e.g. pip install can be pointed to a file location).

@gnottobfly
Copy link
Author

gnottobfly commented Oct 31, 2023

That's very reasonable - I don't blame you. I don't think it'd be an issue for us to use an internal patched version as long as we know 3.0.0 will contain the same functionality. We were just avoiding using a custom fork and maintaining it.

Have you had a chance to look at the latest code changes? I think its in a pretty good state now. Also thinking about documentation - do you want me to write any docs for this feature?

@gnottobfly gnottobfly changed the title [WIP] Support buffers for OB DataElement Values [MGR] Support buffers for OB DataElement Values Oct 31, 2023
@gnottobfly gnottobfly changed the title [MGR] Support buffers for OB DataElement Values [MRG] Support buffers for OB DataElement Values Oct 31, 2023
Copy link
Member

@darcymason darcymason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is getting close. I've got a few small comments.

And yes, we should have documentation -- release note for sure, and something for the 'examples' folder would be ideal -- an example of how to use with a mock large data source. I'd like to have that to 'exercise' the code myself before merging in, to fully understand it and play around with different scenarios.

# a known tag shall only have the VR 'UN' if it has a length that
# exceeds the size that can be encoded in 16 bit - all other cases
# can be seen as an encoding error and can be corrected
if (
VR == VR_.UN
and not tag.is_private
and config.replace_un_with_known_vr
# we don't know the length and such can't determine if the length exceeds
# 0xFFFF
and not is_buffered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line

src/pydicom/dataelem.py Show resolved Hide resolved
@@ -702,6 +740,9 @@ def keyword(self) -> str:

def __repr__(self) -> str:
"""Return the representation of the element."""
if self.is_buffered:
return repr(self.value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a test to cover this repr line as suggested by codecov.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it in favor of deferring to str(self) below

* remove redundant code in __repr__
* remove redundant check in __init__
@gnottobfly
Copy link
Author

and something for the 'examples' folder would be ideal -- an example of how to use with a mock large data source. I'd like to have that to 'exercise' the code myself before merging in, to fully understand it and play around with different scenarios.

Working on this next - I'm going to see how to best replicate our use case in a generic example.

@gnottobfly
Copy link
Author

gnottobfly commented Nov 1, 2023

Here is the sample file:
https://github.com/pydicom/pydicom/assets/94397477/dbc3bfdf-d91f-41a2-ad3a-f6fcf32f385f

I see the pydicom-data project but wasn't sure if this was what you were looking for. Figured I'd put up a minimal example first and then see where you wanted me to go from there.

# set PixelData to the buffer - writing the file will copy the pixeldata from the file to the dicom file
# without reading the pixeldata into memory
dataset.PixelData = cine_pixeldata
dataset.save_as("ds.dcm", write_like_original=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With transfer syntax set can use write_like_original = True which is much better for conformance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can dig more into it, but when I set this to True, I end up with an image that is all gray. Any intuition as to why that is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see in the program I'm using to view it (Horos), I get "unknown UID" - it seems that I'm not setting something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need SOP Class UID in the dataset itself, but it depends on the viewer

@@ -9,6 +9,8 @@

import base64
import copy
from dataclasses import dataclass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import?

Comment on lines +453 to +455
raise ValueError(
f"Invalid VR: {self.VR}. Only the following VRs support buffers: {BUFFERABLE_VRS}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is a bit more clear
f"Elements with a VR of '{self.VR}' cannot be used with buffered values, supported VRs are: {BUFFERABLE_VRS}"

Utilities to help with reading/writing from python's BufferedIOBase.
"""
from contextlib import contextmanager
from collections.abc import Generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from collections.abc import Generator, Iterator and remove import from below

@@ -1219,3 +1222,35 @@ def test_invalid_very_long_value_length(self, vr, value):
DataElement(0x00410001, vr, value, validation_mode=config.WARN)
with pytest.raises(ValueError, match=msg):
DataElement(0x00410001, vr, value, validation_mode=config.RAISE)


class TestBufferedDataElement:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for pickling/unpickling buffered elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to know what happens if you use a buffer and the source is then moved/deleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for pickling/unpickling buffered elements?

Is this an important use-case? I think we'd have the read the buffer into memory for that to work since a BufferedReader isn't pickle-able.

I'd also like to know what happens if you use a buffer and the source is then moved/deleted

If its before we read the buffer, then the assertion in "buffer assertions" should raise an error on a closed stream. Otherwise, I think it'd be undefined behavior. I can write tests for these specifically if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its more that I'd like to see if there's anywhere to put some nice exception messages about why it failed. Otherwise don't worry about it

gnottobfly added a commit to ButterflyNetwork/pydicom that referenced this pull request Nov 8, 2023
* support setting pixel data to a reference to a BufferedReader
* change version to 2.4.3+buffer-support
* See pydicom#1919
@darcymason
Copy link
Member

@gnottobfly, I'm just trying to catch up on this issue and see if we can get over the finish line... what remains? Perhaps the one test with a branch for Windows, a home for the example file (I'm okay with pydicom-data if that works), ...?

@gnottobfly
Copy link
Author

gnottobfly commented Apr 16, 2024

@darcymason I'm so sorry that I never responded and this fell off the radar! I was actually just thinking of this again and was thinking maybe a more elegant solution would be to have a utility function similar to dcmwrite such as dcmwrite_pixeldata that takes an existing dataset/dataset file and writes the pixel data to it. Perhaps this would help untangle some of the if/else logic on whether PixelData is an array of bytes or a Buffer.

Otherwise, I'm happy to come back and fix this up to make it mergeable. Let me know what you think and sorry again for dropping off for a while.

@darcymason
Copy link
Member

@gnottobfly, have you seen the work @scaramallion has been doing over the last few months? There have been some big changes in handling buffers and reworking the structure of pixel handlers etc. I suspect it should be easier to tie into that new work rather than try to fix this merge.

@gnottobfly
Copy link
Author

I haven't but I will take a look! Thanks for the heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants