Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to return Reader for PixelData element #224

Open
ematvey opened this issue Dec 2, 2021 · 5 comments
Open

Add option to return Reader for PixelData element #224

ematvey opened this issue Dec 2, 2021 · 5 comments

Comments

@ematvey
Copy link

ematvey commented Dec 2, 2021

We currently are using DCMTK to

  1. read certain meta tags;
  2. write anonymised image by replacing PatientBirthDate and PatientName with non-identifiable value.

Current issue is 1+ Gb DICOMs which encode a large sequence of frames. It causes an unreasonable amount of CPU and RAM to read PixelDatas from such DICOMs, simply to write them to an anonymised file without any processing.

This library could be a great fit for anonymisation workflow if, for example, it would allow to It would be great if this library supported returning an io.Reader for the whole PixelData element, so that non-image elements could be interpreted but image elements could be streamed to output file without change.

@suyashkumar
Copy link
Owner

suyashkumar commented Dec 8, 2021

Interesting use case! I'm not sure if we could do this safely without leaving the DICOM file/input io.Reader open. E.g. you'd call parse, and under this option parse would finish, but one of the elemnts would have an io.Reader to read in the whole pixeldata? Maybe this could be done on demand (e.g. with a file offset stored in that element, so when you read from the reader, we reopen the file and start fetching the reader the PixelData bytes only--this gets complex when input isn't a file, but a generic io.Reader, which is what the current API is)?

Or maybe with this option, the input io.Reader is open till the user exhausts the PixelData io.Reader?

I think it could be workable, but a bit complex with some weird edge cases I'll need to think about.

One far out there option could be (under this special hypothetical option) to allow the user to register a special processing function for PixelData (or maybe any tag). Maybe something like

RegisterProcessingOption(tag.PixelData, func(r dicomio.Reader) error {
    // You could have a closure over your output file handle here I suppose...but would still be a little weird for your usecase. 
})

Not sure if I love this though, and I don't think we could allow it for just any tag, as some elements depend on prior elements to parse correctly.

Another option that would save you the CPU (but not the memory) is an option that just copies the PixelData bytes into memory without any processing at all, and you could grab it from that special tag during your workflow.

Also, note that this library does support an element by element API with ParseNext() (and a streaming API for PixelData Frames), but still maintains a history of all elements read so far in memory (unfortunately, this is necessary in some cases to "lookback" at previous elements, namely for PixelData and I think some Sequence parsing). It's possible this could be revisited carefully in a "memory light" option / case, but I'll need to think more about what that would look like.

Finally, and I'm sure you know this already, but if you're doing a true anonymization workflow, you probably would want to look at the images, because sometimes identifiable patient information can be in the images.

@suyashkumar
Copy link
Owner

suyashkumar commented Dec 9, 2021

Feel free to suggest an API if you had something specific in mind. Sounds like you were looking for something like I initially mentioned--some API where the returned element from Parser.Next() has an io.Reader to the PixelData you can use to dump the data to your output file. As mentioned, closing the input correctly in a system like this would be something to think about. In situations where the element-by-element Parser API is used directly, I think we could just leave that up to the user. There are some edge cases though, like what if some other thread advances the input reader the PixelData reader was supposed to read from, etc so I'm not sure how safe this is.

Probably easiest thing to do for your case in short term is:

  • at least save some CPU via an option to not parse the PixelData and just read the whole thing into an element buffer
  • explore a memory-light option where we don't hold the full dataset in memory, and only hold the tags that we need to possibly refer back to later for parsing downstream elements. If the user wanted, after calling Parse.Next() they could throw away the element, and move onto the next. (This is something I was thinking about anyway in the past as an optimization).
  • and as mentioned above, if you're doing a sensitive anonymization workflow, you should probably also consider looking at the images as some do have identifiable information "burned in" to the images.

@suyashkumar
Copy link
Owner

I saw there was a PR I think that may be related to this, and put some more thoughts there as well: #223 (review)

@suyashkumar
Copy link
Owner

suyashkumar commented Feb 20, 2023

We did introduce a SkipPixelData option, which skips the PixelData entirely (doesn't even load it into memory).

However, for a use case like this where some tags are modified and others are to be written back identically, we can maybe introduce another option like dicom.NoParseValue(tag1, tag2, tag3) which will still read in the data into the Dataset, but will try to do as little parsing/processing of the value as possible to save CPU. In some cases it may be trickier to not do any parsing at all (e.g. in nested sequences etc), but at least for Native PixelData there would be a speed up.

We can start by implementing support for PixelData first, and then go from there to possibly supporting other tags...

WDYT?

@suyashkumar
Copy link
Owner

Hi there! I have a draft PR #256 that addresses this to some degree--it skips the processing of NativePixelData in favor of just blindly loading the data into memory (so that it can be roundtripped out later).

Ultimately I think the right long term solution might be to instead have an option that indicates only certain elements should be processed, or a lazy load type of situation. However, for now, it seems like this is really most important for PixelData so I figure it is good to start there.

suyashkumar added a commit that referenced this issue Feb 20, 2023
…rving roundtrip read/write (#256)

This addresses _some_ of the discussion in #224. The SkipProcessingPixelDataValue option causes the PixelData bytes to be read into the element, but not processed any further. This option provides an option to save the CPU cycles processing NativePixel data but still gives a roundtrip-able Dataset.

If you want to save both CPU and Memory, you can instead use the SkipPixelData option, which doesn't load the PixelData bytes into memory at all.

In the future, we should consider an option to lazy load/process all Element values if possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants