Integrated handling of decompression and Dataset state #525

darcymason · 2017-12-10T21:07:36Z

Looking for a solution to one of the last remaining problems before version 1.0. As discussed in #521, decompression will have side-effects that change other data elements. Pydicom should keep track of that state and do some updating and checking on writing files.

A simple use case for pydicom might be to decompress a set of files and write them back out again:

# Proposed
for filename in filenames:
    ds = pydicom.dcmread(filename)
    ds.decompress()
    ds.save_as(filename + "2")

I am submitting a pull request to separate out the decompress() step which had not previously existed (the only way to decompress was to ask for the pixels using ds.pixel_array).

At the moment, the state of pixel_array and PixelData are not connected, so that in the past the user would have to use ds.PixelData = pix.tobytes() to set the pixel data before writing the file. Here I propose that pydicom should take care of that step implicitly since decompress() is an in-place modification of the dataset.

The text was updated successfully, but these errors were encountered:

…ansfer Syntax is uncompressed type. Related to #525.

darcymason · 2017-12-12T13:20:40Z

Sorry to belabor this, but I'm stuck on the side-effect thing, and need it solved for version 1,0. To take the above a little further: If code accesses any data element, for example, ds.ImagePositionPatient, or even modifies it, there is an expectation that other data elements remain unchanged. I don't think pixel data should be any different. With an explicit Dataset.decompress(), it is an operation on the dataset, not an individual data element, so having multiple data elements change is more natural, and can be clear in the documentation for that method.

Currently in repository, accessing pixel_array does have side-effects, and I was previously encouraging that, but am now questioning it. I think pixel_array should throw an exception if side-effects are needed, asking the user to first decompress. At least this behavior could be decided based on a config option.

The PR would need to separate things out a little further to make that happen. But I'm interested in comments first before coding that.

rhaxton · 2017-12-12T17:56:31Z

@darcymason Yes, I agree that accessing the pixel_array should not change anything else. This keeps previous behavior. I also like that the state of pixel_array and PixelData are not connected.

To me, it is clear that PixelData is just the content of the original dicom source, pixel_array is just a method to get a numpy array of the PixelData. I would suggest that decompression should remain implicit as I doubt that users would like to have to manage this on their client end (I wouldn't). I am already well trained that if I need to change pixel values and save it to a file, I have to do more that just x.pixel_array = numpy.zeros((10, 10)). There are many DICOM tags I need to update and I don't expect the library to do that for me. That is a job for a higher level library.

I would encourage a return to the previous behaviour where pixel_array and PixelData were two different things and pixel_array had no effect on PixelData.

mrbean-bremen · 2017-12-12T20:22:06Z

I agree with @rhaxton. I have thought a little about the possible use cases, and there are really not many. The most common is probably just accessing the pixel data, as described by @rhaxton, to do some image processing or similar - in this case handling the data independent of the DICOM file is correct, I think. If the result shall be written back to a DICOM file, this has to be done separately.
The other major case is just the decompression of the DICOM data in the DICOM file, as may be needed for example by a storescp implementation, where Dataset.decompress fits nicely.
I couldn't think of any other use cases where pydicom should be involved directly - as @rhaxton mentioned, this is the job of libraries like pynetdicom.

darcymason · 2017-12-13T16:15:23Z

Thanks for your comments.

I'm okay with pixel_array and PixelData separate for the usual case of accessing pixel_array, but @rhaxton, I'm not clear on your opinion on the decompress() method which does link them, changing the PixelData on writing. Since this is new it does not change any behavior that anyone is used to. An alternative could be a decompressed() method, which returns a completely new dataset. Since pixel data is being converted in either case, the duplication in the other data elements does not take up much memory.

We'll have to be clear in the pixel_array documentation that for color three-plane it always returns RGB, even though the dataset PhotometricInterpretation says it is YBR. I realized that anyone checking the PI before accessing pixel_array as opposed to after would have that problem anyway.

mrbean-bremen · 2017-12-15T09:51:00Z

We'll have to be clear in the pixel_array documentation that for color three-plane it always returns RGB, even though the dataset PhotometricInterpretation says it is YBR.

Agreed. Anyway, I think this is the wanted behavior.

rhaxton · 2017-12-15T18:24:26Z

@darcymason
How about this for a thought...

pixel_array always returns a decompressed numpy copy of the data in PixelData (or throws an exception) [This is the 0.9 behavior]
The DICOM PixelData is never changed by anything you do the pixel_array.
Open question:

x = dataset.pixel_array
assert x[0] == 1
x[:] = 0
assert x[0] == 0
x = dataset.pixel_array
assert x[0] == 1 or 0?

Right now, it is 0 because the pixel_array is cached with the dataset. If you wanted to get the original data again, you would have to manually delete/clear either dataset._pixel_array or dataset._pixel_id (or re-read the file). I'm OK with this as I do re-read the file if I want the original data. This is different from other attributes that return copies:

s = x.PatientSex
assert s == 'M'
s = 'O'
s = x.PatientSex
assert s == 'O' # fails with s == 'M'

I don't think that PixelData should be changed on writing after the numpy array has been modified (as the other attributes are not like PatientSex above). However, I think it would be a small improvement to allow:

x.PixelData = numpy.ones((10,10))

That just implicitly calls the .tobytes() function for you. But, then again, people will ask why don't we automatically update the whole pixel module to match (Rows, Cols, BitsStored, etc.)
I feel sure people are generally happy with the way things are interface-wise and they are used to handling most such things themselves.

darcymason · 2017-12-15T19:30:56Z

Right now, it is 0 because the pixel_array is cached with the dataset. If you wanted to get the original data again, you would have to manually delete/clear either dataset._pixel_array or dataset._pixel_id

I was just thinking about that behavior too in the last couple of days. It was cached like that to avoid a lot of conversions (e.g. byte-swapping, if needed) every time the pixel data was accessed. That could be dropped and just documented that it is better to keep your own copy only and don't access it again.

This is different from other attributes that return copies:

Note that is not true for mutables:

ds.PixelSpacing
['0.661468', '0.661468']
spacing = ds.PixelSpacing
spacing[0] = 42
ds.PixelSpacing
['42.0', '0.661468']

For both cases, that is the way python works for any mutables and immutables. The numpy array would normally fall in the mutable camp, but pixel_array is behaving like an immutable.

Another thought (two related ones actually) I had recently, just to confuse this even more ;-) The current behavior does not work well for very large pixel data (see first comment #139; that user has > 1 GB files), because we are keeping two copies of the pixels. The second related point is that if you then try to avoid loading pixels into memory using memmap, then the only thing in the dicom file to map to is the original PixelData. How would a separate copy be implemented? So in the memmap scenario, any changes would be in-place editing the pixel data.

Using the python/numpy buffer protocol, for non-compressed images we could point the numpy array to the pixel bytes directly and have only one copy. Numpy can also transparently deal with byte order without actually moving the bytes around. And then memmap would work in exactly the same way as fully loaded pixels. How to use memmap and deal with compressed formats, I'm not sure.

mrbean-bremen · 2017-12-16T18:57:03Z

The current behavior does not work well for very large pixel data

The point about large pixel data is a good one, but I would leave that for a post 1.0 version. Using memmap for the original data is probably a good idea, but for compressed data a memory copy would be needed anyway, in my opinion. Another (additional) approach regarding large data sets (which are usually multi-frame) would be the possibility to read single frames in multi-frame data - which is trivial in case of uncompressed data, but certainly more demanding in the case of compressed data. Also something to think about after 1.0...
Update: the last statement was too pessimistic - already handled by @hackermd!

About mutable/immutable pixel data:
I also think that the original data should not be changed in the above scenario, only if using decompress() on the whole data set, though that is not a strong opinion.

Regarding decompress() vs decompressed():
decompressed() feels more natural in the context, but decompress() may be more convenient in the scanario where you just want to replace the original data set, and also needs less memory. Maybe implement both, or implement one and add the second as convenience later?

darcymason · 2020-05-18T21:00:37Z

I'm closing this for now. I think the decompress() function is documented as to its effects, and we are kind of in a stable place with this for a while.

darcymason added a commit that referenced this issue Dec 10, 2017

Keep track if pixels have been decompressed, check on writing that Tr…

13ed083

…ansfer Syntax is uncompressed type. Related to #525.

darcymason mentioned this issue Dec 10, 2017

[WIP] Decompression state #526

Merged

darcymason mentioned this issue Dec 15, 2017

Pixel Data is compressed in a format pydicom does not yet handle #524

Closed

darcymason mentioned this issue Dec 15, 2017

Pixel data handlers shall update PhotometricInterpretation #521

Closed

mrbean-bremen added this to the v1.1 milestone Feb 19, 2018

This was referenced May 15, 2018

In Sequences default_encoding is used instead of the parents one. #625

Closed

Release 1.1 #644

Closed

mrbean-bremen modified the milestones: v1.1, v1.2 Jul 18, 2018

mrbean-bremen removed this from the v1.2 milestone Sep 10, 2018

darcymason added this to the v1.5 milestone Jun 29, 2019

darcymason closed this as completed May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrated handling of decompression and Dataset state #525

Integrated handling of decompression and Dataset state #525

darcymason commented Dec 10, 2017

darcymason commented Dec 12, 2017

rhaxton commented Dec 12, 2017

mrbean-bremen commented Dec 12, 2017

darcymason commented Dec 13, 2017

mrbean-bremen commented Dec 15, 2017

rhaxton commented Dec 15, 2017

darcymason commented Dec 15, 2017

mrbean-bremen commented Dec 16, 2017 •

edited

darcymason commented May 18, 2020

Integrated handling of decompression and Dataset state #525

Integrated handling of decompression and Dataset state #525

Comments

darcymason commented Dec 10, 2017

darcymason commented Dec 12, 2017

rhaxton commented Dec 12, 2017

mrbean-bremen commented Dec 12, 2017

darcymason commented Dec 13, 2017

mrbean-bremen commented Dec 15, 2017

rhaxton commented Dec 15, 2017

darcymason commented Dec 15, 2017

mrbean-bremen commented Dec 16, 2017 • edited

darcymason commented May 18, 2020

mrbean-bremen commented Dec 16, 2017 •

edited