You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a student of NLP and I am studying the castorini/hedwig implementation of DocBERT.
I would like to try using RoBERTa. My question is in the implementation of convert_examples_to_features (in abstract_processor.py) for this goal. I think RoBERTa has a different way of adding the tokens.
By simply changing the classes and modeling data, the Transformers code modeling for RoBERTa is warning (see below) that the tokens are not applied. In the code that throws that warning, it checks if the first token is 0. Currently, it is 3 for [CLS].
Warning: "A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your tokenize.encode()or tokenizer.convert_tokens_to_ids()."
If I modify that method, do you think it's simply a matter of adding 0 at the beginning of input_ids (after adjusting for the length for this additional token) to make it work correctly? I tried it but it is not getting a good score compared to DocBERT.
The text was updated successfully, but these errors were encountered:
I am a student of NLP and I am studying the castorini/hedwig implementation of DocBERT.
I would like to try using RoBERTa. My question is in the implementation of convert_examples_to_features (in abstract_processor.py) for this goal. I think RoBERTa has a different way of adding the tokens.
By simply changing the classes and modeling data, the Transformers code modeling for RoBERTa is warning (see below) that the tokens are not applied. In the code that throws that warning, it checks if the first token is 0. Currently, it is 3 for [CLS].
Warning: "A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your tokenize.encode()or tokenizer.convert_tokens_to_ids()."
If I modify that method, do you think it's simply a matter of adding 0 at the beginning of input_ids (after adjusting for the length for this additional token) to make it work correctly? I tried it but it is not getting a good score compared to DocBERT.
The text was updated successfully, but these errors were encountered: