Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attention masks tokenizer #126

Open
Ch-rode opened this issue Mar 21, 2022 · 2 comments
Open

attention masks tokenizer #126

Ch-rode opened this issue Mar 21, 2022 · 2 comments

Comments

@Ch-rode
Copy link

Ch-rode commented Mar 21, 2022

Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code

model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')

def preprocessing_for_tape(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode(
            sent,  # Preprocess sentence
            #add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            #max_length=MAX_LEN,                  # Max length to truncate/pad
            #pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            #return_attention_mask=True,
            #truncation=True     # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
      

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks`

sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')
token_ids

tensor([[ 2, 11,  7, 23, 25,  9,  8, 21,  7, 15, 13, 11, 16, 11,  5, 13, 15, 15,
         17, 11,  7, 25, 13, 11, 22, 11, 22, 15, 25,  5,  5, 11,  5, 15, 13, 23,
         20,  3]])

But my output (for example) will have only token ids (no attention mask and no possibility to set max_length or padding).
How does it works? Thanks

@rmrao
Copy link
Collaborator

rmrao commented Mar 21, 2022

Hi! Do you specifically want to re-implement bert-base, or just a transformer? I have code to train a version of ESM-1b here. This code scales better and will also result in better performance.

In that repo, the data processing is done in these lines. The masking code is then implemented in this class.

I have a bunch of utilities implemented in github.com/rmrao/evo, if it's helpful.

If you specifically want the masking code from TAPE, it's implemented here.

Hope this helps!

@Ch-rode
Copy link
Author

Ch-rode commented Mar 21, 2022

Hello ! Thanks for your informations. I would like to re-implement bert-base for Sequence Classification task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants