Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write PixelData to DICOM Dataset on disk without loading the PixelData into memory #1913

Open
gnottobfly opened this issue Oct 18, 2023 · 6 comments

Comments

@gnottobfly
Copy link

gnottobfly commented Oct 18, 2023

I have a Python program that creates a .dcm file. It uses pydicom to write the appropriate tags and save the file. I have long mp4 videos that when uncompressed uses about 10gb memory. I'm trying to avoid loading the entire video into memory.

Does Pydicom support setting the PixelData property to a file pointer or stream?

@darcymason
Copy link
Member

darcymason commented Oct 19, 2023

Does Pydicom support setting the PixelData property to a file pointer or stream?

No, that is beyond the scope of pydicom directly. Because DICOM doesn't have "records" of fixed size, the only way to know where a data element begins in file is to write out all the previous ones.

Here are some thoughts (untested) on a workaround:

Create a new dcmwrite-like method with extra parameters to the call, e.g. a Tag (pixel data in this case, 0x7fe00010) and an Iterable[bytes] generator to give chunks of pixel data at a time. The new method would create a Dataset, called e.g. pre_tag_ds (using slicing, i.e. pre_tag_ds = ds[:tag]) of all the data elements prior to the given Tag, and likewise one for tags after (if any), post_tag_ds = ds[tag+1:] . Then the method would open the output file, call dcmwrite with the opened file and pre_tag_ds. Then the hardest part: the code would have to write the data element tag, VR (if applicable) and length (has to be known ahead, or save the location and write it afterwards), then the pixel data in chunks from the generator, then call dcmwrite again with the (still open file) to write post_tag_ds.

One caution for the generator, is that DICOM, IIUC, would require encapsulation for the data -- if you search you will find some previous issues/discussions about that topic; pydicom has some support for encapsulation.

I think that could work. If someone wrote something like that, I think we could incorporate it into pydicom.

@darcymason
Copy link
Member

A couple of other thoughts:

  • the pre_tag_ds should also have file_meta copied from the original dataset.
  • for writing the one data element, that might be a little easier by just using a one with a blank value (0-length value) and then storing the file_tell to backtrack and write the correct length later.

I'm also musing about a completely different path -- add a callable (generator) to the DataElement class, something like generate_value. The user would have to set the callable to their own generator which returns a chunk of bytes at a time, and raise StopIteration when complete. Then write_data_element would have to be modified a little, to check for the callable and loop to write the .value from a generator rather than just assuming it is bytes in memory. This still leaves writing the correct length in the data element header, but write_data_element could keep track of the file position and seek back to write that after all the total length of the chunks had been tallied.

@gnottobfly
Copy link
Author

gnottobfly commented Oct 19, 2023

Thanks for the quick reply!

I'm also musing about a completely different path -- add a callable (generator) to the DataElement class, something like generate_value. The user would have to set the callable to their own generator which returns a chunk of bytes at a time, and raise StopIteration when complete. Then write_data_element would have to be modified a little, to check for the callable and loop to write the .value from a generator rather than just assuming it is bytes in memory. This still leaves writing the correct length in the data element header, but write_data_element could keep track of the file position and seek back to write that after all the total length of the chunks had been tallied.

This is what I'm attempting currently! We know the length of bytes ahead of time because we can look at the filesize of the pixel data on disk. So far I've made a class that implements __len__ to read the filesize from disk. The part I'm stuck on is the writing:

def write_OBvalue(fp: DicomIO, elem: DataElement) -> None:
    """Write a data_element with VR of 'other byte' (OB)."""
    # len(elem.value) is the filesize of the pixel data file
    if len(elem.value) % 2:
        # Pad odd length values
        fp.write(cast(bytes, elem.value))
        fp.write(b'\x00')
    else:
        fp.write(cast(bytes, elem.value))

It seems this function would need to change a bit as you said to see if there is a value_generator

@gnottobfly
Copy link
Author

Here is the hacked function I came up with:

def write_OBvalue(fp: DicomIO, elem: DataElement) -> None:
    """Write a data_element with VR of 'other byte' (OB)."""
    READ_SIZE = 1024*500
    if isinstance(elem.value, BufferedReader):
        # write chunks at a time
        while (bytes_ := elem.value.read(READ_SIZE)):
            fp.write(bytes_)

            if len(elem.value) % 2:
                fp.write(b'\x00')
    else:
        if len(elem.value) % 2:
            # Pad odd length values
            fp.write(cast(bytes, elem.value))
            fp.write(b'\x00')
        else:
            fp.write(cast(bytes, elem.value))

Do you think there is a way to support this kind of use case or would this be out of scope? Open to API suggestions too!

@darcymason
Copy link
Member

darcymason commented Oct 19, 2023

I think that looks pretty good, and I wouldn't mind incorporating that in pydicom, but perhaps without necessarily knowing the length. I wondered about doing it for any element but practically it should only be OB so I'm okay with that.

There would have to be checks in other places in code, like in __str__ and __repr__ to put some kind of placeholder representation for the value, other than that I don't foresee any problems.

@gnottobfly
Copy link
Author

gnottobfly commented Oct 24, 2023

#1913 (comment)

This is the approach I'm going to go with. I'm thinking we can also make it completely compatible with all existing code by making the getter for value return the entire content of the buffer if the value is a buffer.

Then code can opt-in to the iterator interface by checking DataElement for is_buffered. That should also achieve " I wondered about doing it for any element".

Thanks for being open to the change and for the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants