Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old school bit-plane overlay handling #201

Open
moloney opened this issue Mar 14, 2022 · 15 comments
Open

Old school bit-plane overlay handling #201

moloney opened this issue Mar 14, 2022 · 15 comments

Comments

@moloney
Copy link

moloney commented Mar 14, 2022

Does this project scrub the high bits in PixelData (above "BitsStored" and below "BitsAllocated") to clear out overlays stored this way? Initially I thought this is what the pixel cleaner code is for, but it looks like this is just for handling "burned in" text overlays where the only option is to blank out a predefined rectangular region.

@vsoch
Copy link
Member

vsoch commented Mar 14, 2022

That's correct - the pixel cleaner is just for the text overlays.

If there is some other functionality you'd like, we welcome a PR, or providing enough detail so someone else could implement it.

@moloney
Copy link
Author

moloney commented Mar 14, 2022

I am happy to make a code contribution, but I am still trying to learn my way around the project at this point so any guidance on how to accomplish this would be appreciated.

I am basically looking to remove the old (now deprecated) style of overlays described in the note at the end of page 707 of this document: https://dicom.nema.org/MEDICAL/Dicom/2004/printed/04_03pu3.pdf

These overlays can be cleared by zeroing all the bits of each pixel element above the number of "BitsStored". If "BitsStored" == "BitsAllocated" this can be skipped. In theory there should be an "OverlayBitPosition" element present if these types of overlays are being used, but it is safer to just always zero out these "extra" high bits if they exist.

So there are basically three types of overlays:

  1. New style using the Overlay Module (https://dicom.nema.org/dicom/2013/output/chtml/part03/sect_C.9.html) which are easy to detect and delete
  2. Old style as described above, which are still relatively easy to delete but are less obvious (can't necessarily be detected without looking through the pixel data)
  3. "Burned in" where the original pixel value is overwritten and the only way to automatically detect it is using OCR, otherwise you are stuck predefining regions to blank out (based on details about the instrument / software / etc. used).

So handling 1 is easy by just deleting the overlay elements in your recipe, and 3 is handled by the current pixel cleaning code. I would like to add support for handling 2.

@vsoch
Copy link
Member

vsoch commented Mar 15, 2022

Ah gotcha! So if you have used pydicom before, what you'd want to do is write a little script that shows loading a dicom dataset, and then checking and parsing the attributes. Example images would help here that we can add to tests (small and anonymized ideally). If you don't have example images, then minimally it would be good to send me something I can work with (and I won't put anywhere / will delete when I finish - I just need it to test and develop). Once you have that example and can show me, I can figure out the best UI interaction to add. E.g., it might be a different kind of clean, or something we do by default (and disabled with a flag) given that we find that kind of data. I do like how you've laid out those three categories of overlays, and I think we should add something like that to the docs to explain the options (of course when the time comes).

@jstorrs
Copy link
Contributor

jstorrs commented Oct 17, 2022

https://www.medicalconnections.co.uk/kb/Number-Of-Overlays-In-Image

In general, I think the most paranoid thing should be the default/supported in deid. i.e. we should verify all unused bits in the pixel data are empty on output so that they cannot leak any information. I don't think we'd necessarily want to convert old overlays to new. That seems like something that should be added to pydicom itself if there's any want for it. My recommendation is deid just destroys old-school overlays until or unless pydicom provides other options for handling them.

I'll need to confirm but my hunch is old-school overlays were retired when support for compressed/encoded pixel data was added so we can possibly branch based whether there are any available bits that can be cleared.

@howff
Copy link
Contributor

howff commented Nov 10, 2022

Deleting the overlays (or setting the high bits to zero for old-school overlays) will certainly help to deidentify. I would note that even CTP doesn't clear the high bits when deidentifying so this would give pydicom an advantage! However I think it's overkill for the purposes of removing PII.

Can we apply the same rectangle redaction to overlays, as we do to image planes?

DICOMs can have

  • multiple image frames
  • multiple overlays embedded in the high bits
  • multiple overlays
  • multiple frames per overlay

In order to deidentify without damaging anything else we need to redact rectangles where PII text is found, and leave all other parts alone. This means keeping overlays and removing only the sensitive text on them.

I'd like to be able to say "remove rectangle (x,y,w,h) from frame 27 of overlay 13" for example.

Can we do that?

@vsoch
Copy link
Member

vsoch commented Nov 10, 2022

We can do almost anything if someone can show me a dummy example in code. ;)

@jstorrs
Copy link
Contributor

jstorrs commented Nov 10, 2022

Basically what happens in the old-school overlays is that the images have "Bits Allocated" and "Bits Stored". This is a pixel-by-pixel setting and Bits Stored can be smaller than Bits Allocated (may not be accurate I'm just working from memory). So, for example, if you have 16-bits allocated per pixel and 14-bits stored that leave two extra bits that are wasted and unused and can be used for storing two overlays. I think where they end up also depends on byte-order and high bit settings. I'm just speaking generally here. In these old-school overlays you can just apply masks to the allocated bits to select the image vs selecting the overlay bits.

To remove the overlays you basically set the extra bits that are allocated but not stored to false (or true if you want to be annoying). And you can store the extracted overlays in the new format. This transformation of overlay styles is what I was suggesting belongs in pydicom more generally. Or... not depending on how much they still exist in the real world.

I suspect the interplay with these extra bit planes with compression/lossy compression is not pretty (particularly if the overlays are modified) which is why this method of overlays was retired. Compressed images don't really have dead space sitting around and the new overlay format just packs the individual bits together (so they have to be decoded). The "new" style overlays can also have higher resolution than the original image.

Anyway deleting the old overlays by masking pixel bits is the easiest approach. Translating old-school overlays to new overlays seems like a utility function that belongs in pydicom itself. Just my two c.

I personally have never seen a multiframe where PHI appears in a rectangle on some frames and that same location (rectangle) contains non-PHI on other frames. They might be blank but clearing blank bits isn't a problem. i.e. I've not encountered cases where anything would be lost by applying the same mask to the entire multiframe image.

See also the discussion at the end of http://www.dclunie.com/medical-image-faq/html/part1.html

@howff
Copy link
Contributor

howff commented Nov 11, 2022

I think we need four options. The original author of this issue probably wants option 1.

  1. unpack the image pixels, zero-out the high bits, re-write the file. That will ensure the image is safe from having overlays hidden in the high bits that most other deidentification software fails to remove.

  2. unpack the image pixels, burn the overlays hidden in the high bits onto the image pixels, zero-out the high bits, re-write the file. That allows you to redact a rectangle and also ensure the high bits are completely safe.

  3. apply the rectangle-redaction method to a specified overlay. For example we may only want to remove the patient name from overlay 2 which is hidden inside bit 14 of the image pixels. Requires unpacking the image pixels, zero-out the high bits within the rectangle, re-write the file.

  4. like option 2, but for any type of overlay. (ok, it's not exactly the title of this issue, but it's a very related problem)

There are some sample images in the gdcm conformance test file collection which may be useful.

@howff
Copy link
Contributor

howff commented Nov 23, 2022

This handles option 1, as the original poster wanted.
Does the code handle all cases?
It doesn't recompress, and I'm not sure I'm setting the transfer syntax correctly.

elem_OverlayBitPosition = 0x0102

def remove_overlays_in_high_bits(ds):
    """ Mask off the high-bits of all the pixel values
    just in case there are overlays hidden in there
    which we want to remove (simply removing the 60xx tags
    does not actually remove the overlay pixels).
    """

    # Sanity check that image has pixels
    if not 'PixelData' in ds:
        logger.debug('no pixel data present')
        return

    # bits_allocated is the physical space used, 1 or a multiple of 8.
    # bits_stored is the number of meaningful bits within those allocated.
    bits_stored = ds['BitsStored'].value if 'BitsStored' in ds else -1
    bits_allocated = ds['BitsAllocated'].value if 'BitsAllocated' in ds else -1
    bit_mask = (~((~0) << bits_stored))
    samples = ds['SamplesPerPixel'].value if 'SamplesPerPixel' in ds else -1
    photometric = ds['PhotometricInterpretation'].value if 'PhotometricInterpretation' in ds else 'MONOCHROME2'

    # This code calculates a bit mask from the actual overlays in use.
    # It is not used.
    # Instead we simply mask off everything outside the bits_stored bits.
    overlay_bitmask = 0
    for overlay_num in range(16):
        overlay_group_num = 0x6000 + 2 * overlay_num
        if [overlay_group_num, elem_OverlayBitPosition] in ds:
            overlay_bit = ds[overlay_group_num, elem_OverlayBitPosition].value
            # Sometimes it is prevent but value is None!
            if not overlay_bit:
                overlay_bit = 0
            # Bit position must be >0 for a high-bit overlay
            if overlay_bit > 0:
                logger.debug('Found overlay in high-bit %d' % overlay_bit)
                overlay_bitmask |= (1 << overlay_bit)
    logger.debug('bits_stored = %d (image bits used)' % bits_stored)
    logger.debug('bits_allocated = %d (physical space needed)' % bits_allocated)
    logger.debug('bit_mask = %x (use & to get only image data)' % bit_mask)
    logger.debug('overlay_bitmask = %x (for overlays in use)' % overlay_bitmask)
    logger.debug('samples = %d' % samples)

    # Can only handle greyscale or palette images
    # XXX would an overlay every be present in an RGB image? Doesn't make sense?
    if photometric not in ['MONOCHROME1', 'MONOCHROME2', 'PALETTE COLOR']:
        logger.debug('cannot remove overlays from %s' % photometric)
        return

    # Can only handle 1 sample per pixel
    # XXX would an overlay be present if multiple samples per pixel? Doesn't make sense?
    if samples > 1:
        logger.debug('cannot remove overlays from %d samples per pixel' % samples)
        return

    pixel_data = ds.pixel_array # this can raise an exception in some files
    logger.debug('ndim = %d (should be the same as samples)' % pixel_data.ndim)

    # Use numpy to mask the bits, handles both 8 and 16 bits per pixel.
    masked = (pixel_data & bit_mask)
    # Could also do this?
    #if pixel_data.ndim == 3:
    #    masked = (pixel_data[frame,:,:] & bit_mask)
    #else:
    #    masked = (pixel_data[frame,:,:,:] & bit_mask)
    ds.PixelData = masked.tobytes()

    # XXX does not re-compress
    if sys.byteorder == 'little':
        ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian
    else:
        ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRBigEndian
    # No need to handle ICC Profile, Colour Space, Palette Lookup Table, etc
    # No need to handle 2's complement Pixel Representation.
    return

@vsoch
Copy link
Member

vsoch commented Nov 23, 2022

@moloney would you care to test this out to see if it fits your use case, and give feedback to @howff ? I think we could likely provide this as a custom deid-provided function if it's useful.

@howff
Copy link
Contributor

howff commented Nov 24, 2022

Here's a sample program which implements all required options.
https://gist.github.com/howff/bcb104a3486fdc8a2dd6c2134ad4a0f0
I've tested it on a few images (multiple frames, multiple overlays in high-bits, multiple separate overlays).
Please can someone review it?
@vsoch

@vsoch
Copy link
Member

vsoch commented Nov 24, 2022

@howff I think we need the original issue poster @moloney to test it out - I don't have any dicom data hanging around with this issue to test.

@howff
Copy link
Contributor

howff commented Nov 28, 2022

A good way of testing pydicom is to use the sample files in https://sourceforge.net/projects/gdcm/files/gdcmData/ and https://sourceforge.net/projects/gdcm/files/gdcmConformanceTests/ Some filenames from these sets are in the script. You can also get sample files, such as multi-frame images, from https://gdcm.sourceforge.net/wiki/index.php/Sample_DataSet (I used the first link on that page).

@vsoch
Copy link
Member

vsoch commented Nov 28, 2022

Yes, but we do need the original poster as the source of truth to test and report that the issue is fixed or not.

@howff
Copy link
Contributor

howff commented Jan 17, 2023

@moloney Is the code provided above helpful to you? Does it work ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants