Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For sentence classification using BERT, PAD token is used in IG/Deeplift? #1269

Open
lkqnaruto opened this issue Apr 10, 2024 · 1 comment
Open

Comments

@lkqnaruto
Copy link

For sentence classification task using BERT, is the PAD token used in IG/Deeplift? or Unkown token? or it can be customized?

@EldadTalShir
Copy link

The default reference in IG is a zero scalar corresponding to each input tensor (effectively PAD for BERT). It can be customized by setting the 'baselines' parameter when calling the attribute function. For example (setting UNK as reference, assuming seq_len are the number of tokens in your input):

# Custom token for IG
from transformers import AutoTokenizer
from captum.attr import TokenReferenceBase

tokenizer = AutoTokenizer.from_pretrained('all-MiniLM-L6-v2') # Load your model's tokenizer
ref_token_id = tokenizer.unk_token_id  # Choose the id of your desired token, you can call tokenizer.all_special_tokens for a list of all special tokens supported by your model
token_reference = TokenReferenceBase(reference_token_idx=ref_token_id) # Use Captum to generate a reference based on the number of tokens in your input
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
ref = token_reference.generate_reference(seq_len,device=device).unsqueeze(0)

Then when you call attribute set baselines=ref. You can follow this guide as well: https://captum.ai/tutorials/IMDB_TorchText_Interpret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants