Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrated handling of decompression and Dataset state #525

Closed
darcymason opened this issue Dec 10, 2017 · 9 comments
Closed

Integrated handling of decompression and Dataset state #525

darcymason opened this issue Dec 10, 2017 · 9 comments
Milestone

Comments

@darcymason
Copy link
Member

Looking for a solution to one of the last remaining problems before version 1.0. As discussed in #521, decompression will have side-effects that change other data elements. Pydicom should keep track of that state and do some updating and checking on writing files.

A simple use case for pydicom might be to decompress a set of files and write them back out again:

# Proposed
for filename in filenames:
    ds = pydicom.dcmread(filename)
    ds.decompress()
    ds.save_as(filename + "2")

I am submitting a pull request to separate out the decompress() step which had not previously existed (the only way to decompress was to ask for the pixels using ds.pixel_array).

At the moment, the state of pixel_array and PixelData are not connected, so that in the past the user would have to use ds.PixelData = pix.tobytes() to set the pixel data before writing the file. Here I propose that pydicom should take care of that step implicitly since decompress() is an in-place modification of the dataset.

darcymason added a commit that referenced this issue Dec 10, 2017
…ansfer Syntax is uncompressed type. Related to #525.
@darcymason
Copy link
Member Author

Sorry to belabor this, but I'm stuck on the side-effect thing, and need it solved for version 1,0. To take the above a little further: If code accesses any data element, for example, ds.ImagePositionPatient, or even modifies it, there is an expectation that other data elements remain unchanged. I don't think pixel data should be any different. With an explicit Dataset.decompress(), it is an operation on the dataset, not an individual data element, so having multiple data elements change is more natural, and can be clear in the documentation for that method.

Currently in repository, accessing pixel_array does have side-effects, and I was previously encouraging that, but am now questioning it. I think pixel_array should throw an exception if side-effects are needed, asking the user to first decompress. At least this behavior could be decided based on a config option.

The PR would need to separate things out a little further to make that happen. But I'm interested in comments first before coding that.

@rhaxton
Copy link
Contributor

rhaxton commented Dec 12, 2017

@darcymason Yes, I agree that accessing the pixel_array should not change anything else. This keeps previous behavior. I also like that the state of pixel_array and PixelData are not connected.

To me, it is clear that PixelData is just the content of the original dicom source, pixel_array is just a method to get a numpy array of the PixelData. I would suggest that decompression should remain implicit as I doubt that users would like to have to manage this on their client end (I wouldn't). I am already well trained that if I need to change pixel values and save it to a file, I have to do more that just x.pixel_array = numpy.zeros((10, 10)). There are many DICOM tags I need to update and I don't expect the library to do that for me. That is a job for a higher level library.

I would encourage a return to the previous behaviour where pixel_array and PixelData were two different things and pixel_array had no effect on PixelData.

@mrbean-bremen
Copy link
Member

I agree with @rhaxton. I have thought a little about the possible use cases, and there are really not many. The most common is probably just accessing the pixel data, as described by @rhaxton, to do some image processing or similar - in this case handling the data independent of the DICOM file is correct, I think. If the result shall be written back to a DICOM file, this has to be done separately.
The other major case is just the decompression of the DICOM data in the DICOM file, as may be needed for example by a storescp implementation, where Dataset.decompress fits nicely.
I couldn't think of any other use cases where pydicom should be involved directly - as @rhaxton mentioned, this is the job of libraries like pynetdicom.

@darcymason
Copy link
Member Author

Thanks for your comments.

I'm okay with pixel_array and PixelData separate for the usual case of accessing pixel_array, but @rhaxton, I'm not clear on your opinion on the decompress() method which does link them, changing the PixelData on writing. Since this is new it does not change any behavior that anyone is used to. An alternative could be a decompressed() method, which returns a completely new dataset. Since pixel data is being converted in either case, the duplication in the other data elements does not take up much memory.

We'll have to be clear in the pixel_array documentation that for color three-plane it always returns RGB, even though the dataset PhotometricInterpretation says it is YBR. I realized that anyone checking the PI before accessing pixel_array as opposed to after would have that problem anyway.

@mrbean-bremen
Copy link
Member

We'll have to be clear in the pixel_array documentation that for color three-plane it always returns RGB, even though the dataset PhotometricInterpretation says it is YBR.

Agreed. Anyway, I think this is the wanted behavior.

@rhaxton
Copy link
Contributor

rhaxton commented Dec 15, 2017

@darcymason
How about this for a thought...

pixel_array always returns a decompressed numpy copy of the data in PixelData (or throws an exception) [This is the 0.9 behavior]
The DICOM PixelData is never changed by anything you do the pixel_array.
Open question:

x = dataset.pixel_array
assert x[0] == 1
x[:] = 0
assert x[0] == 0
x = dataset.pixel_array
assert x[0] == 1 or 0?

Right now, it is 0 because the pixel_array is cached with the dataset. If you wanted to get the original data again, you would have to manually delete/clear either dataset._pixel_array or dataset._pixel_id (or re-read the file). I'm OK with this as I do re-read the file if I want the original data. This is different from other attributes that return copies:

s = x.PatientSex
assert s == 'M'
s = 'O'
s = x.PatientSex
assert s == 'O' # fails with s == 'M'

I don't think that PixelData should be changed on writing after the numpy array has been modified (as the other attributes are not like PatientSex above). However, I think it would be a small improvement to allow:

x.PixelData = numpy.ones((10,10))

That just implicitly calls the .tobytes() function for you. But, then again, people will ask why don't we automatically update the whole pixel module to match (Rows, Cols, BitsStored, etc.)
I feel sure people are generally happy with the way things are interface-wise and they are used to handling most such things themselves.

@darcymason
Copy link
Member Author

Right now, it is 0 because the pixel_array is cached with the dataset. If you wanted to get the original data again, you would have to manually delete/clear either dataset._pixel_array or dataset._pixel_id

I was just thinking about that behavior too in the last couple of days. It was cached like that to avoid a lot of conversions (e.g. byte-swapping, if needed) every time the pixel data was accessed. That could be dropped and just documented that it is better to keep your own copy only and don't access it again.

This is different from other attributes that return copies:

Note that is not true for mutables:

ds.PixelSpacing
['0.661468', '0.661468']
spacing = ds.PixelSpacing
spacing[0] = 42
ds.PixelSpacing
['42.0', '0.661468']

For both cases, that is the way python works for any mutables and immutables. The numpy array would normally fall in the mutable camp, but pixel_array is behaving like an immutable.

Another thought (two related ones actually) I had recently, just to confuse this even more ;-) The current behavior does not work well for very large pixel data (see first comment #139; that user has > 1 GB files), because we are keeping two copies of the pixels. The second related point is that if you then try to avoid loading pixels into memory using memmap, then the only thing in the dicom file to map to is the original PixelData. How would a separate copy be implemented? So in the memmap scenario, any changes would be in-place editing the pixel data.

Using the python/numpy buffer protocol, for non-compressed images we could point the numpy array to the pixel bytes directly and have only one copy. Numpy can also transparently deal with byte order without actually moving the bytes around. And then memmap would work in exactly the same way as fully loaded pixels. How to use memmap and deal with compressed formats, I'm not sure.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Dec 16, 2017

The current behavior does not work well for very large pixel data

The point about large pixel data is a good one, but I would leave that for a post 1.0 version. Using memmap for the original data is probably a good idea, but for compressed data a memory copy would be needed anyway, in my opinion. Another (additional) approach regarding large data sets (which are usually multi-frame) would be the possibility to read single frames in multi-frame data - which is trivial in case of uncompressed data, but certainly more demanding in the case of compressed data. Also something to think about after 1.0...
Update: the last statement was too pessimistic - already handled by @hackermd!

About mutable/immutable pixel data:
I also think that the original data should not be changed in the above scenario, only if using decompress() on the whole data set, though that is not a strong opinion.

Regarding decompress() vs decompressed():
decompressed() feels more natural in the context, but decompress() may be more convenient in the scanario where you just want to replace the original data set, and also needs less memory. Maybe implement both, or implement one and add the second as convenience later?

@mrbean-bremen mrbean-bremen added this to the v1.1 milestone Feb 19, 2018
@mrbean-bremen mrbean-bremen modified the milestones: v1.1, v1.2 Jul 18, 2018
@mrbean-bremen mrbean-bremen removed this from the v1.2 milestone Sep 10, 2018
@darcymason darcymason added this to the v1.5 milestone Jun 29, 2019
@darcymason
Copy link
Member Author

I'm closing this for now. I think the decompress() function is documented as to its effects, and we are kind of in a stable place with this for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants