[MRG] Support buffers for OB DataElement Values #1919

gnottobfly · 2023-10-25T15:19:59Z

Support buffers for OB DataElement Values

Putting up an early PR to get some early feedback before investing too much time. Early comments on the approach or API welcomed!

This allows setting a DataElement's value to a buffer. The motivation behind this is to write a dataset to disk without copying all of a large image/video into memory to do so. See #1913 for the discussion that led to this. For now I'm only targeting byte values as those are the ones that are expected to be large.

Describe the changes

Allows assinging the DataElement's value to a BufferedIOBase
Adds a new method value_generator that yields chunks from the buffer

Tasks

Unit tests added that reproduce the issue or prove feature is working
Fix or feature added
Code typed and mypy shows no errors
Documentation updated (if relevant)
- No warnings during build
- Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
Unit tests passing and overall coverage the same or better

* Allows assinging the DataElement's value to a BufferedIOBase * Adds a new method value_generator that yields chunks from the buffer

codecov · 2023-10-25T17:15:42Z

Codecov Report

Merging #1919 (f47c4bf) into main (d59c339) will increase coverage by 0.01%.
Report is 4 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head f47c4bf differs from pull request most recent head 894fff8. Consider uploading reports for the commit 894fff8 to get more accurate results

@@            Coverage Diff             @@
##             main    #1919      +/-   ##
==========================================
+ Coverage   97.74%   97.75%   +0.01%     
==========================================
  Files          63       64       +1     
  Lines       10739    10802      +63     
==========================================
+ Hits        10497    10560      +63     
  Misses        242      242

Files	Coverage Δ
src/pydicom/dataelem.py	`98.50% <100.00%> (+0.07%)`	⬆️
src/pydicom/filewriter.py	`98.11% <100.00%> (+0.10%)`	⬆️
src/pydicom/util/buffers.py	`100.00% <100.00%> (ø)`
src/pydicom/valuerep.py	`99.27% <100.00%> (+<0.01%)`	⬆️

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

darcymason

Very quick initial review ... I've posted a few comments but will need to absorb this for a while longer for the bigger structural issues.

I'm trying to think through how this would be used in practice. For example, here the full value is read in if value is accessed, but that seems to defeat the purpose. However, if in practice people just substitute in a generator value for a 'dummy' small value just before writing then maybe that can be avoided.

I thought briefly about suggesting a subclass DataElementBuffered rather than updating a bunch of code within DataElement, but after a little thought I'm thinking one class is probably better.

I'll try to go over a little more in the next couple of days.

darcymason · 2023-10-25T16:18:01Z

src/pydicom/dataelem.py

+        """Input parameter used to store how many bytes have been read from
+        a buffered value"""
+        bytes_read: int = 0
+

    def __init__(
        self,


On BufferInfo class, why not just store bytes_read directly in DataElement?

I was thinking of the value_generator function having a return type of the # of bytes read so thought an output parameter made sense here. I think its most likely unnecessary.

darcymason · 2023-10-25T16:33:10Z

src/pydicom/dataelem.py

@@ -160,6 +162,20 @@ class DataElement:
    maxBytesToDisplay = 16
    showVR = True
    is_raw = False
+    is_buffered = False
+    """
+    Tracks whether the data element value is a buffer. If so, calling self.value will read the entire


These are class attributes meant to apply to all instances of the class (is_raw is used for all instances, because RawDataElement sets its own is_raw).

In any case, I think it better to make it a read-only property is_buffered, which checks the type of _value. Then they can never get out of synch. E.g. see my comment on changing the value property setter.

darcymason · 2023-10-25T16:42:41Z

src/pydicom/dataelem.py

+        # handle a buffered value - ie, a buffer pointing to a file on disk
+        if isinstance(value, BufferedIOBase):
+            if VR not in VR_.OB:
+                raise ValueError(f"Invalid VR: {self.VR}. Only the OB VR supports buffered values.")


Should just be != here instead of not in.

I updated this with a tuple but good catch!

darcymason · 2023-10-25T20:02:55Z

src/pydicom/dataelem.py

@@ -449,6 +482,48 @@ def value(self, val: Any) -> None:

        self._value = self._convert_value(val)

+
+    def value_generator(self, *, buffer_info: BufferInfo = None, chunk_size = 1024) -> Iterator[bytes]:
+        """Consume data from a buffered value. The value must be a buffer, the buffer must be


I think probably at least 8K chunk default is better. We had this come up with pydicom's undefined length reads before (#436).

But I'm also now wondering whether this generator is needed. Any code using buffered reading can just directly use value after checking it is a buffered reader. I need to ponder a little more ...

darcymason · 2023-10-25T20:20:07Z

src/pydicom/dataelem.py

+        if self.is_buffered:
+            self._value = cast(BufferedIOBase, self._value).read()
+            self.is_buffered = False
+
        return self._value



I think we need to update the value setter code as well - value could be set to a buffer after initial creation. The logic to check that on __init__ probably belongs in the value setter, actually, where it can handle both cases.

Good thinking - will update.

gnottobfly · 2023-10-26T14:29:48Z

Thanks for the initial feedback! I had some thinking about subclassing it as well but felt like it would end up leading to more complicated code.

The reason I went with the value getter reading the entire buffer is because I wasn't sure if it made sense to make all code in the codebase buffer aware or maybe to allow more of a gradual upgrade for support. For our particular use case since we only need it to write the value to the file, it touches a pretty small amount of code. But what should be the expected behavior if you assign PixelData to a buffer and then try to compress/decompress it or access pixel_array for instance?

* use @Property for dynamic is_buffered * use value setter to handle buffered value * remove BufferInfo class

* remove default buffer read behavior

gnottobfly · 2023-10-26T15:49:43Z

After thinking about it a bit more, I think you are right on not reading the full buffer by default as that'll be hard to code against and inevitably may lead to reading in the whole thing by accident. From your perspective as a maintainer, is it okay if certain operations raise an exception if the code isn't setup to deal with a buffered value?

gnottobfly · 2023-10-26T15:53:45Z

tests/test_filewriter.py

+                # memory usage is in bytes
+                limit = MEGABYTE * 400
+            else:
+                pytest.skip("This test is not setup to run on this platform")


TODO: I need to find what the other platform's unit is for this test. I'm also open to other implementation's of this test but it was the only way I could think of to ensure we didn't read the file into memory.

Still needs to be done?

* formatting * remove invalid test

gnottobfly · 2023-10-26T16:27:23Z

src/pydicom/filewriter.py

+        curr_pos = fp.tell()
+        fp.seek(vr_length_seek_pos)
+        fp.write_UL(elem.bytes_read_from_buffer)
+        fp.seek(curr_pos)


This feels pretty hacky - I'd love to get some thoughts on how to handle this. I'm stuck on how to handle writing the length of data before we know the length of data without having to backtrack without calling stat on the original file the buffer is from.

I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer. Also note that we don't want to seek backwards at all as this breaks certain use cases where fp is a stream or a zip file I think (I don't remember what use case it was exactly).

Edit: Ok, the problem was with writing gzip file-like (see #753). There was also an old issue about writing to a stream (#154).

I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer

The problem is that I'm not writing the data to the buffer but instead to the file directly. The goal is to avoid reading the entire buffered element into memory and instead to transfer the data directly from the element (which contains a buffered reader) to the dicom file.

So this means that we don't know the length until we read all of the data (and write it to the file).

For a little context, the goal is to avoid reading large uncompressed files into memory.

Edit: Ok, the problem was with writing gzip file-like (see #753). There was also an old issue about writing to a stream (#154).

That's quite the conundrum... any thoughts on how to tackle it then without reading the entirety of PixelData into a memory buffer?

The problem is that I'm not writing the data to the buffer but instead to the file directly.

A right, I should have seen that - thanks!

That's quite the conundrum... any thoughts on how to tackle it then without reading the entirety of PixelData into a memory buffer?

Right now I have no idea. May have to think about it... How easy would it be to configure that no buffering is used? That would still allow the use case where seeking back is not possible (in case we find no other solution).

src/pydicom/dataelem.py

src/pydicom/filewriter.py

gnottobfly · 2023-10-29T15:36:40Z

@darcymason @mrbean-bremen I was thinking about the seeking issue and I think I have a solution. Instead of supporting any IO buffer, I think it would make sense to limit it to a FileIO buffer. This would allow us to call stat on the file name to get the length. If the goal is to keep large files out of memory (which it is for us), then it doesn't really make sense to support in memory streams anyways.

If someone would like to read pixel data from a non-seekable stream like a socket, then they can write the file to a temporary file first.

It'd also make the code a little easier to manage as we wouldn't need to monkey around with keeping track of the first 4 bytes to check for encapsulation. I'll make some changes to see how that would look on Monday or Tuesday but let me know what y'all think.

darcymason · 2023-10-29T17:00:19Z

I think it would make sense to limit it to a FileIO buffer. This would allow us to call stat on the file name to get the length. If the goal is to keep large files out of memory (which it is for us), then it doesn't really make sense to support in memory streams anyways.

I was also thinking of saying we just start with a seek-able file handle anyway, it is still a big step ahead from where we are now.

I'm not sure about using stat on the file for the length, though. Someone may be sourcing their info from part-way through another file (e.g. a different DICOM file). Maybe the read should start within the open file handle where it is positioned, and if it doesn't read until end (length can be determined by seeking end, and tell()ing, then seeking back), then the caller would need to provide the length.

mrbean-bremen · 2023-10-29T18:19:43Z

I was also thinking of saying we just start with a seek-able file handle anyway, it is still a big step ahead from where we are now.

Completely agree, thanks!

* move buffer read code to a utils file * create util to reset buffer position * glean buffered value length from using buffer.tell and seeking

* fix existing tests

gnottobfly · 2023-10-30T18:00:17Z

src/pydicom/dataelem.py

        # a known tag shall only have the VR 'UN' if it has a length that
        # exceeds the size that can be encoded in 16 bit - all other cases
        # can be seen as an encoding error and can be corrected
        if (
            VR == VR_.UN
            and not tag.is_private
            and config.replace_un_with_known_vr
+            # we don't know the length and such can't determine if the length exceeds
+            # 0xFFFF
+            and not is_buffered


question: should I call buffer_length to get the length here?

Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line

darcymason · 2023-10-30T20:53:57Z

'main' is the right one to target.

Can I backport this to a 2.x release once it gets accepted? Just thinking about how to get this into our codebase without having to wait for a 3.0.0 release and then upgrade all of our code.

Maybe I could be persuaded, but I'd like to stay with the philosophy that 2.x is in major-bug-fix-only mode, we have enough on our plates dealing with PRs and issues as it is. Would you not be able to patch 2.X locally and set up pip installs to your own local code-base? (e.g. pip install can be pointed to a file location).

gnottobfly · 2023-10-31T14:05:32Z

That's very reasonable - I don't blame you. I don't think it'd be an issue for us to use an internal patched version as long as we know 3.0.0 will contain the same functionality. We were just avoiding using a custom fork and maintaining it.

Have you had a chance to look at the latest code changes? I think its in a pretty good state now. Also thinking about documentation - do you want me to write any docs for this feature?

darcymason

Looks like it is getting close. I've got a few small comments.

And yes, we should have documentation -- release note for sure, and something for the 'examples' folder would be ideal -- an example of how to use with a mock large data source. I'd like to have that to 'exercise' the code myself before merging in, to fully understand it and play around with different scenarios.

darcymason · 2023-10-31T19:22:16Z

src/pydicom/dataelem.py

        # a known tag shall only have the VR 'UN' if it has a length that
        # exceeds the size that can be encoded in 16 bit - all other cases
        # can be seen as an encoding error and can be corrected
        if (
            VR == VR_.UN
            and not tag.is_private
            and config.replace_un_with_known_vr
+            # we don't know the length and such can't determine if the length exceeds
+            # 0xFFFF
+            and not is_buffered


Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line

src/pydicom/dataelem.py

darcymason · 2023-10-31T19:27:36Z

src/pydicom/dataelem.py

@@ -702,6 +740,9 @@ def keyword(self) -> str:

    def __repr__(self) -> str:
        """Return the representation of the element."""
+        if self.is_buffered:
+            return repr(self.value)
+


Would be good to have a test to cover this repr line as suggested by codecov.

Removed it in favor of deferring to str(self) below

* remove redundant code in __repr__ * remove redundant check in __init__

gnottobfly · 2023-11-01T16:13:02Z

and something for the 'examples' folder would be ideal -- an example of how to use with a mock large data source. I'd like to have that to 'exercise' the code myself before merging in, to fully understand it and play around with different scenarios.

Working on this next - I'm going to see how to best replicate our use case in a generic example.

gnottobfly · 2023-11-01T18:35:45Z

Here is the sample file:
https://github.com/pydicom/pydicom/assets/94397477/dbc3bfdf-d91f-41a2-ad3a-f6fcf32f385f

I see the pydicom-data project but wasn't sure if this was what you were looking for. Figured I'd put up a minimal example first and then see where you wanted me to go from there.

examples/write_video_uncompressed.py

scaramallion · 2023-11-01T23:47:52Z

examples/write_video_uncompressed.py

+    # set PixelData to the buffer - writing the file will copy the pixeldata from the file to the dicom file
+    # without reading the pixeldata into memory
+    dataset.PixelData = cine_pixeldata
+    dataset.save_as("ds.dcm", write_like_original=False)


With transfer syntax set can use write_like_original = True which is much better for conformance

I can dig more into it, but when I set this to True, I end up with an image that is all gray. Any intuition as to why that is?

Ah, I see in the program I'm using to view it (Horos), I get "unknown UID" - it seems that I'm not setting something.

You probably need SOP Class UID in the dataset itself, but it depends on the viewer

src/pydicom/dataelem.py

* clean up example to be more conformant

scaramallion · 2023-11-04T05:43:36Z

src/pydicom/dataelem.py

@@ -9,6 +9,8 @@

 import base64
 import copy
+from dataclasses import dataclass


Unused import?

scaramallion · 2023-11-04T05:48:47Z

src/pydicom/dataelem.py

+                raise ValueError(
+                    f"Invalid VR: {self.VR}. Only the following VRs support buffers: {BUFFERABLE_VRS}."
+                )


The following is a bit more clear
f"Elements with a VR of '{self.VR}' cannot be used with buffered values, supported VRs are: {BUFFERABLE_VRS}"

scaramallion · 2023-11-04T05:52:23Z

src/pydicom/util/buffers.py

+Utilities to help with reading/writing from python's BufferedIOBase.
+"""
+from contextlib import contextmanager
+from collections.abc import Generator


from collections.abc import Generator, Iterator and remove import from below

scaramallion · 2023-11-04T06:10:18Z

tests/test_dataelem.py

@@ -1219,3 +1222,35 @@ def test_invalid_very_long_value_length(self, vr, value):
            DataElement(0x00410001, vr, value, validation_mode=config.WARN)
        with pytest.raises(ValueError, match=msg):
            DataElement(0x00410001, vr, value, validation_mode=config.RAISE)
+
+
+class TestBufferedDataElement:


Can you add tests for pickling/unpickling buffered elements?

I'd also like to know what happens if you use a buffer and the source is then moved/deleted

Can you add tests for pickling/unpickling buffered elements?

Is this an important use-case? I think we'd have the read the buffer into memory for that to work since a BufferedReader isn't pickle-able.

I'd also like to know what happens if you use a buffer and the source is then moved/deleted

If its before we read the buffer, then the assertion in "buffer assertions" should raise an error on a closed stream. Otherwise, I think it'd be undefined behavior. I can write tests for these specifically if you'd like.

I think its more that I'd like to see if there's anywhere to put some nice exception messages about why it failed. Otherwise don't worry about it

* support setting pixel data to a reference to a BufferedReader * change version to 2.4.3+buffer-support * See pydicom#1919

darcymason · 2023-11-18T20:21:53Z

@gnottobfly, I'm just trying to catch up on this issue and see if we can get over the finish line... what remains? Perhaps the one test with a branch for Windows, a home for the example file (I'm okay with pydicom-data if that works), ...?

gnottobfly · 2024-04-16T16:03:23Z

@darcymason I'm so sorry that I never responded and this fell off the radar! I was actually just thinking of this again and was thinking maybe a more elegant solution would be to have a utility function similar to dcmwrite such as dcmwrite_pixeldata that takes an existing dataset/dataset file and writes the pixel data to it. Perhaps this would help untangle some of the if/else logic on whether PixelData is an array of bytes or a Buffer.

Otherwise, I'm happy to come back and fix this up to make it mergeable. Let me know what you think and sorry again for dropping off for a while.

darcymason · 2024-04-16T17:18:04Z

@gnottobfly, have you seen the work @scaramallion has been doing over the last few months? There have been some big changes in handling buffers and reworking the structure of pixel handlers etc. I suspect it should be easier to tie into that new work rather than try to fix this merge.

gnottobfly · 2024-04-16T18:42:57Z

I haven't but I will take a look! Thanks for the heads up.

Support buffers for OB DataElement Values

e2d1f3b

* Allows assinging the DataElement's value to a BufferedIOBase * Adds a new method value_generator that yields chunks from the buffer

gnottobfly changed the title ~~Support buffers for OB DataElement Values~~ [WIP] Support buffers for OB DataElement Values Oct 25, 2023

darcymason reviewed Oct 25, 2023

View reviewed changes

gnottobfly added 3 commits October 26, 2023 10:45

Update handling of buffered value

d4873c3

* use @Property for dynamic is_buffered * use value setter to handle buffered value * remove BufferInfo class

* do not try to convert a buffered value

5882c57

Handle writing a buffered value for PixelData

c84719f

* remove default buffer read behavior

gnottobfly commented Oct 26, 2023

View reviewed changes

gnottobfly added 2 commits October 26, 2023 12:17

* remove helper function

7b7eb48

* formatting * remove invalid test

* just making the test pass

d8f5fbd

gnottobfly commented Oct 26, 2023

View reviewed changes

scaramallion reviewed Oct 26, 2023

View reviewed changes

src/pydicom/dataelem.py Outdated Show resolved Hide resolved

scaramallion reviewed Oct 26, 2023

View reviewed changes

src/pydicom/dataelem.py Outdated Show resolved Hide resolved

scaramallion reviewed Oct 26, 2023

View reviewed changes

src/pydicom/filewriter.py Outdated Show resolved Hide resolved

gnottobfly added 8 commits October 30, 2023 12:48

Updates to handling buffered values

f3ce265

* move buffer read code to a utils file * create util to reset buffer position * glean buffered value length from using buffer.tell and seeking

Add util for reading buffer length

e55edb2

formatting

8ed066f

add tests for buffer utils

0eabf02

formatting

6279275

Add is_buffered to RawDataElem

8db601b

* fix existing tests

Run mypy and fix type issue

2947c83

fix test

37dab61

gnottobfly commented Oct 30, 2023

View reviewed changes

gnottobfly added 2 commits October 30, 2023 14:03

* improve test coverage to ow writer

332f657

improve test coverage

d1467bf

reduce big value size

bdba005

gnottobfly changed the title ~~[WIP] Support buffers for OB DataElement Values~~ [MGR] Support buffers for OB DataElement Values Oct 31, 2023

gnottobfly changed the title ~~[MGR] Support buffers for OB DataElement Values~~ [MRG] Support buffers for OB DataElement Values Oct 31, 2023

clean up comment

a96d16e

darcymason reviewed Oct 31, 2023

View reviewed changes

* more test coverage

6d9e03d

* remove redundant code in __repr__ * remove redundant check in __init__

minimal example of writing uncompressed video to dataset

3b592f6

add docstring

a3329bf

scaramallion reviewed Nov 1, 2023

View reviewed changes

examples/write_video_uncompressed.py Outdated Show resolved Hide resolved

scaramallion reviewed Nov 1, 2023

View reviewed changes

examples/write_video_uncompressed.py Outdated Show resolved Hide resolved

scaramallion reviewed Nov 1, 2023

View reviewed changes

examples/write_video_uncompressed.py Outdated Show resolved Hide resolved

scaramallion reviewed Nov 1, 2023

View reviewed changes

src/pydicom/dataelem.py Outdated Show resolved Hide resolved

gnottobfly and others added 3 commits November 2, 2023 10:12

* remove unused imports

3fd3218

* clean up example to be more conformant

* fix comment

5c2690c

Merge branch 'main' into bfly/pixel_data_buffer

f47c4bf

scaramallion reviewed Nov 4, 2023

View reviewed changes

src/pydicom/dataelem.py

@@ -9,6 +9,8 @@

import base64

import copy

from dataclasses import dataclass

Copy link

Member

scaramallion Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import?

scaramallion reviewed Nov 4, 2023

View reviewed changes

add tests for closed buffers

894fff8

gnottobfly added a commit to ButterflyNetwork/pydicom that referenced this pull request Nov 8, 2023

Support buffers for OB DataElement Values

e1118d5

* support setting pixel data to a reference to a BufferedReader * change version to 2.4.3+buffer-support * See pydicom#1919

[MRG] Support buffers for OB DataElement Values #1919

Are you sure you want to change the base?

[MRG] Support buffers for OB DataElement Values #1919

Conversation

gnottobfly commented Oct 25, 2023 • edited

Support buffers for OB DataElement Values

Describe the changes

Tasks

codecov bot commented Oct 25, 2023 • edited

Codecov Report

darcymason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnottobfly Oct 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darcymason Oct 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnottobfly commented Oct 26, 2023

gnottobfly commented Oct 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrbean-bremen Oct 26, 2023 • edited

Choose a reason for hiding this comment

gnottobfly Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnottobfly commented Oct 29, 2023

darcymason commented Oct 29, 2023

mrbean-bremen commented Oct 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darcymason commented Oct 30, 2023

gnottobfly commented Oct 31, 2023 • edited

darcymason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnottobfly commented Nov 1, 2023

gnottobfly commented Nov 1, 2023 • edited

scaramallion Nov 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scaramallion Nov 6, 2023 • edited

Choose a reason for hiding this comment

darcymason commented Nov 18, 2023

gnottobfly commented Apr 16, 2024 • edited

darcymason commented Apr 16, 2024

gnottobfly commented Apr 16, 2024

gnottobfly commented Oct 25, 2023 •

edited

codecov bot commented Oct 25, 2023 •

edited

gnottobfly Oct 26, 2023 •

edited

darcymason Oct 25, 2023 •

edited

mrbean-bremen Oct 26, 2023 •

edited

gnottobfly Oct 27, 2023 •

edited

gnottobfly commented Oct 31, 2023 •

edited

gnottobfly commented Nov 1, 2023 •

edited

scaramallion Nov 1, 2023 •

edited

scaramallion Nov 6, 2023 •

edited

gnottobfly commented Apr 16, 2024 •

edited