Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anonymise DICOM pixel data using OCR #252

Open
howff opened this issue May 2, 2023 · 8 comments
Open

Anonymise DICOM pixel data using OCR #252

howff opened this issue May 2, 2023 · 8 comments

Comments

@howff
Copy link
Contributor

howff commented May 2, 2023

I notice you have a request for anonymising pixel data using OCR. I have been working on this, but in a separate code base, not as modifications to deid. It turns out that the hardest part is the evaluation, not the actual OCR. What I can report right now is that easyocr (python library) gives really excellent results. There's still a few things to watch out for, but it would be quite easy to integrate easyocr into deid I think.

@vsoch
Copy link
Member

vsoch commented May 2, 2023

That sounds great! Let me know what I can do to support you for that.

@omri374
Copy link

omri374 commented May 18, 2023

We (working on Microsoft Presidio) currently have this capability in beta: https://microsoft.github.io/presidio/image-redactor/
We'd be happy to collaborate on this.

@howff
Copy link
Contributor Author

howff commented May 18, 2023

Thanks very much for your contribution @omri374 !
I see that it's using Tesseract for OCR and SpaCy for NER/PII.
In my experience Tesseract is dreadful at OCR in the real world (I'm testing on all radiology images for a whole country), needing too much pre-processing and then giving a very poor result.
And in my experience SpaCy is very unreliable at NER in this context (it's ok for sentences, sometimes, but useless for text fragments found by OCR in radiology images).
I'm happy to hear that you seem to have had better success though.

@omri374
Copy link

omri374 commented May 19, 2023

Hi @howff, Presidio is very customizable, and allows you to plug in multiple tools. Currently, we are using Tesseract, but we are working on a next version which would allow you to plug any OCR easily: microsoft/presidio#1049

As this is still in design, we'd be very happy to get your feedback on this based on your experience with DICOM de-identification and are open to contributions of all sorts.

For NER, we support multiple NLP tools like Huggingface and Flair as well. In our demo, you can experiment with two BERT based approaches, and a flair approach: https://huggingface.co/spaces/presidio/presidio_demo

I agree that any NER wouldn't necessarily be accurate for OCR, so we use hints from the DICOM metadata, and can customize the detection of PHI using other approaches such as rule based patterns and deny-lists.

@omri374
Copy link

omri374 commented May 19, 2023

cc @niwilso

@howff
Copy link
Contributor Author

howff commented May 19, 2023

That's exactly the same approach I've taken here (see ocrengine.py and nerengine.py) https://github.com/SMI/dicompixelanon

@omri374
Copy link

omri374 commented May 19, 2023

@howff this looks great!
@vsoch and @howff, if you'd like to collaborate on this, and see how we can integrate all of this into a Presidio+PyDICOM tool, we would be happy to work on this together.

@vsoch
Copy link
Member

vsoch commented May 19, 2023

Yeah! I’m happy to help however I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants