Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEID dropping UIDs #259

Open
peter-kuzmak opened this issue Dec 12, 2023 · 11 comments
Open

DEID dropping UIDs #259

peter-kuzmak opened this issue Dec 12, 2023 · 11 comments

Comments

@peter-kuzmak
Copy link

Hi Everyone,

I am brand new to DEID and very inexperienced with Python. I do have three decades of experience with DICOM though and have used commercial products to do deidentification in the past.

We would like to use DEID in the future.

My first program is from the Getting Started web tutorial. We I try it, the parser deletes the SOP Instance UID (0008,00118), Study Instance UID (0020,000D), Series Instance UID (0020,000E), Instance Number (0020,0013), Media Storage SOP Instance UID (0002,0003) and Reference SOP Instance UID (0008,1155). It did add Patient Identity Removed (0012,0062). I am using a different DICOM object than the ones provided in deid_data.

I have a minimal recipe:

FORMAT dicom

%header

KEEP SOPInstanceUID
ADD PatientIdentityRemoved YES

The web example retains the same UIDs.

Any idea why the UIDs are been dropped?

Thanks!

Peter Kuzmak, Department of Veterans Affairs

@vsoch
Copy link
Member

vsoch commented Dec 12, 2023

hey @peter-kuzmak ! I have limited bandwidth to help on this, but if you can share a dummy dicom file and the dicom.deid and the exact way to reproduce your use case, I can likely make some time over a weekend. Thanks!

@wetzelj
Copy link
Contributor

wetzelj commented Dec 13, 2023

Hi @peter-kuzmak,

While I have a good amount of experience with this library, I've never tried to follow the getting started guide. Based on the results your seeing, I suspect that your recipe file is getting applied in addition to the "stock" deid.dicom (https://github.com/pydicom/deid/blob/master/deid/data/deid.dicom). In the stock version, you'll see two lines that would be responsible for removing any UIDs from the header (REMOVE endswith:ID and REMOVE contains:UID).

In my application I utilize the replace_identifiers function for header manipulation - and specify the recipes in the parameters to this function.

Like @vsoch, I don't have a great deal of bandwidth but am happy to chime in with input. It's always great to see others using the library!

@peter-kuzmak
Copy link
Author

Hi Venessa and James,

I am still at a loss. I have changed /deid/data/deid.dicom to comment out the REMOVE's. I still get the UIDs removed.

I am using replace_identifiers for header manipulation. I have attached the files.

Thanks for your help!

Peter
kuzmak.zip

@wetzelj
Copy link
Contributor

wetzelj commented Dec 14, 2023

Hi @peter-kuzmak,

Looking at deid.dicom-kuzmak;

FORMAT dicom

%values cookie_names
SPLIT PatientID by="^";minlength=4

%values operator_names
FIELD startswith:Operator

%values patient_name
FIELD PatientName

%values patient_id
FIELD PatientID

%fields instance_fields
FIELD contains:Instance

%header

ADD PatientIdentityRemoved YES
REPLACE values:cookie_names var:id
REPLACE values:operator_names var:source_id
REMOVE fields:instance_fields
ADD SOPInstanceUID 1.2.3.4.5.6.7.8.9.10

Your issue seems to be the "instance_fields" fields list. By creating a fields list in this way you're creating a group of all UID fields (Every DICOM Field where the field name contains "Instance"). By using "contains:Instance" you're capturing StudyInstanceUID, SeriesInstanceUID, InstanceNumber, etc. Further down in your recipe, you have REMOVE fields:instance_fields which is going to then direct deid to remove the fields identified in the fields list.

If you remove the the following lines from your recipe, the UIDs should be retained:

%fields instance_fields
FIELD contains:Instance

REMOVE fields:instance_fields

Hopefully this helps. If not, we'll go from there! :)

@wetzelj
Copy link
Contributor

wetzelj commented Dec 14, 2023

You also may be overusing %values lists... depending on what you're trying to do.

%values operator_names
FIELD startswith:Operator

With this command, the operator_names values list will contain every tag that has a VALUE that is any field that starts with Operator. For example, let's say tag (0008,1070) Operator's Name contains a value of "SIMPSON". With a values list, all DICOM header tags will be scanned for a value of "SIMPSON". If this is contained within the tag's value, it's added to the values list. In other words, if the (0010,0010) Patient Name tag contains "SIMPSON^LISA", it would be identified by the values list.

With this example, your command REPLACE values:operator_names var:source_id would end up replacing both (0008,1070) and (0010,0010) with the value in source_id.

If you're really just trying to replace fields starting with Operator with the value in source_id, you could eliminate the values list all together and just go with single directive:

REPLACE startswith:Operator var:source_id

In my experience values lists are mostly beneficial for removals. My primary use of values lists centers around removing private tags that contain PHI. The way I use a values list is as follows:

%values known_phi
FIELD AccessionNumber
FIELD PatientID
FIELD PatientBirthDate
SPLIT PatientName by="^";minlength=4

The known_phi values list gets us a list of every tag that contains a value that is one of the identifiers in the list for the particular study.

This is followed with:

REMOVE  values:known_phi

@peter-kuzmak
Copy link
Author

peter-kuzmak commented Dec 14, 2023 via email

@peter-kuzmak
Copy link
Author

peter-kuzmak commented Dec 14, 2023 via email

@wetzelj
Copy link
Contributor

wetzelj commented Dec 14, 2023

Glad to help! There's a lot of functionality in the library and takes some time to get used to it all.

@peter-kuzmak
Copy link
Author

peter-kuzmak commented Jan 24, 2024 via email

@wetzelj
Copy link
Contributor

wetzelj commented Jan 24, 2024

Hi Peter,

Happy New Year!
Regarding your questions:

  1. I have a rule that encrypts UIDs. It works on Study Instance UID (0020,000D). It does not work on Media Storage SOP Instance UID(0002,0003), although it is in the recipe.

Can you double check your recipe on this? While I don't have personal experience targeting this field, it was a later addition to the library to enable targeting of meta fields and to that end, there is a specific unit test which covers replacing MediaStorageSOPInstanceUID. Do you have a specific recipe rule that targets MediaStorageSOPInstanceUID by name or are you trying to rely on an expander? Your issue makes me wonder if perhaps the expanders aren't including the meta fields.

  1. I can't figure out how to de-identify sequences. They seem to be passed untouched.

Again, this isn't an area that I've explored in great detail. Typically in my projects we either remove sequences entirely or find that using a values list and then performing a REMOVE based on the values list clears the desired information from the sequence. Can you provide samples of your recipe and the headers?

  1. How do you incorporate the pixel cleaning? I thought it would be done automatically as part of parser.parse().

The pattern that we follow is to first create a DicomCleaner object and then run the detect method on each file being deidentified. The DicomCleaner.detect() will provide an output result which determines which filters apply to the file (blacklist or graylist). If a blacklist filter was detected, you can simply throw away the file. If a greylist filter was detected, then you need to call DicomCleaner.clean() to actually perform the pixel manipulation followed by DicomCleaner.save_dicom() to save the result file.

Only after the above is completed do we call replace_identifiers(). This enables us to ensure that we're only performing replace_identifiers if file actually needs to be included in the result set. We're not processing the identifier replacement on blacklisted files.

There might be better ways to do this now with Parser.parse. The process that I've outlined above is some of our earliest code using deid... honestly, I think it was written before parse was added to the library.

@vsoch - Do you have an thoughts on @peter-kuzmak's questions?

@vsoch
Copy link
Member

vsoch commented Jan 25, 2024

We have around 60k retinal studies

I would want to get the machine information and double check we have it included.

@wetzelj is spot on for his answers. My suggestion would be to get us a dummy image that reproduces your issue so we can test it and either help to demonstrate a solution or put in a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants