Anonymise DICOM pixel data using OCR #252

howff · 2023-05-02T14:42:23Z

I notice you have a request for anonymising pixel data using OCR. I have been working on this, but in a separate code base, not as modifications to deid. It turns out that the hardest part is the evaluation, not the actual OCR. What I can report right now is that easyocr (python library) gives really excellent results. There's still a few things to watch out for, but it would be quite easy to integrate easyocr into deid I think.

vsoch · 2023-05-02T17:37:05Z

That sounds great! Let me know what I can do to support you for that.

omri374 · 2023-05-18T12:44:58Z

We (working on Microsoft Presidio) currently have this capability in beta: https://microsoft.github.io/presidio/image-redactor/
We'd be happy to collaborate on this.

howff · 2023-05-18T20:34:30Z

Thanks very much for your contribution @omri374 !
I see that it's using Tesseract for OCR and SpaCy for NER/PII.
In my experience Tesseract is dreadful at OCR in the real world (I'm testing on all radiology images for a whole country), needing too much pre-processing and then giving a very poor result.
And in my experience SpaCy is very unreliable at NER in this context (it's ok for sentences, sometimes, but useless for text fragments found by OCR in radiology images).
I'm happy to hear that you seem to have had better success though.

omri374 · 2023-05-19T07:34:26Z

Hi @howff, Presidio is very customizable, and allows you to plug in multiple tools. Currently, we are using Tesseract, but we are working on a next version which would allow you to plug any OCR easily: microsoft/presidio#1049

As this is still in design, we'd be very happy to get your feedback on this based on your experience with DICOM de-identification and are open to contributions of all sorts.

For NER, we support multiple NLP tools like Huggingface and Flair as well. In our demo, you can experiment with two BERT based approaches, and a flair approach: https://huggingface.co/spaces/presidio/presidio_demo

I agree that any NER wouldn't necessarily be accurate for OCR, so we use hints from the DICOM metadata, and can customize the detection of PHI using other approaches such as rule based patterns and deny-lists.

omri374 · 2023-05-19T07:35:09Z

cc @niwilso

howff · 2023-05-19T07:42:24Z

That's exactly the same approach I've taken here (see ocrengine.py and nerengine.py) https://github.com/SMI/dicompixelanon

omri374 · 2023-05-19T09:52:21Z

@howff this looks great!
@vsoch and @howff, if you'd like to collaborate on this, and see how we can integrate all of this into a Presidio+PyDICOM tool, we would be happy to work on this together.

vsoch · 2023-05-19T12:01:39Z

Yeah! I’m happy to help however I can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anonymise DICOM pixel data using OCR #252

Anonymise DICOM pixel data using OCR #252

howff commented May 2, 2023

vsoch commented May 2, 2023

omri374 commented May 18, 2023

howff commented May 18, 2023

omri374 commented May 19, 2023 •

edited

omri374 commented May 19, 2023

howff commented May 19, 2023

omri374 commented May 19, 2023

vsoch commented May 19, 2023

Anonymise DICOM pixel data using OCR #252

Anonymise DICOM pixel data using OCR #252

Comments

howff commented May 2, 2023

vsoch commented May 2, 2023

omri374 commented May 18, 2023

howff commented May 18, 2023

omri374 commented May 19, 2023 • edited

omri374 commented May 19, 2023

howff commented May 19, 2023

omri374 commented May 19, 2023

vsoch commented May 19, 2023

omri374 commented May 19, 2023 •

edited