Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocRoBERTa #51

Open
daleangus opened this issue Feb 27, 2020 · 0 comments
Open

DocRoBERTa #51

daleangus opened this issue Feb 27, 2020 · 0 comments

Comments

@daleangus
Copy link

daleangus commented Feb 27, 2020

I am a student of NLP and I am studying the castorini/hedwig implementation of DocBERT.

I would like to try using RoBERTa. My question is in the implementation of convert_examples_to_features (in abstract_processor.py) for this goal. I think RoBERTa has a different way of adding the tokens.

By simply changing the classes and modeling data, the Transformers code modeling for RoBERTa is warning (see below) that the tokens are not applied. In the code that throws that warning, it checks if the first token is 0. Currently, it is 3 for [CLS].

Warning: "A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your tokenize.encode()or tokenizer.convert_tokens_to_ids()."

If I modify that method, do you think it's simply a matter of adding 0 at the beginning of input_ids (after adjusting for the length for this additional token) to make it work correctly? I tried it but it is not getting a good score compared to DocBERT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant