Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Use Custom SpaCy Model (beki/en_spacy_pii_distilbert) with Anonymize and Sensitive Scanners #112

Open
rakendd opened this issue Mar 22, 2024 · 1 comment

Comments

@rakendd
Copy link

rakendd commented Mar 22, 2024

Hello llm_guard Team,

I've been exploring the use of custom models with the Anonymize and Sensitive scanners within the llm_guard library, as mentioned in the changelog for the latest release. Specifically, I'm interested in integrating the SpaCy model beki/en_spacy_pii_distilbert for PII detection tasks.

Objective
My goal is to leverage the beki/en_spacy_pii_distilbert model, which is not a traditional Hugging Face Transformer model but rather a SpaCy model, for enhanced PII detection accuracy and reduced latency as highlighted in your changelog.

Issue
I encountered difficulties when attempting to load and use this SpaCy model with the Anonymize scanner. Typically, the process for integrating models relies on specifying a model path or configuration that is compatible with Hugging Face's Transformer models. However, given that beki/en_spacy_pii_distilbert is a SpaCy model, the standard approach doesn't seem to apply.

Attempts
Here's an outline of my approach so far, based on the available documentation and examples:

Model Specification: Attempted to specify beki/en_spacy_pii_distilbert directly as a model path or through a configuration dictionary.
Custom Recognizer: Explored creating a custom recognizer to wrap the SpaCy model loading and analysis logic.
Adapter Pattern: Considered using an adapter to bridge the gap between the expected input/output formats of the llm_guard scanners and the SpaCy model.
The last approach is kind of working. But wanted to know best practise to use this model inside llm_guard

custom_recognizer = CustomSpacyRecognizer()  
adapter = CustomRecognizerAdapter(custom_recognizer=custom_recognizer)


vault = Vault()
scanner = Anonymize(
    vault=vault,
    language="en",
    use_faker=True,
    custom_recognizer=adapter  # Passing the adapter as the custom recognizer
)

Could you provide guidance or examples on how to correctly integrate a SpaCy model like beki/en_spacy_pii_distilbert with the Anonymize and Sensitive scanners?

Thank you for developing llm_guard and for your support in enhancing its capabilities. I look forward to your advice on integrating SpaCy models for improved PII detection.

Best regards,
Rakend

@asofter
Copy link
Collaborator

asofter commented Mar 22, 2024

Hey @rakendd , thanks for reaching out. We used to have this model but then realized that it blocked updates to the latest transformers due to dependency on "spacy-transformers>=1.1.8,<1.2.0".

https://llm-guard.com/changelog/#030-2023-10-14

I think if this model can be updated, then we could make another custom recognizer or just use the spacy one like we did before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants