AllenNLP biased towards BERT #5711

pvcastro · 2022-09-13T21:29:12Z

Checklist

[ x ] I have verified that the issue exists against the main branch of AllenNLP.
[ x ] I have read the relevant section in the contribution guide on reporting bugs.
[ x ] I have checked the issues list for similar or identical bug reports.
[ x ] I have checked the pull requests list for existing proposed fixes.
[ x ] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
I have included in the "Description" section below a traceback from any exceptions related to this bug.
[ x ] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
[ x ] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[ x ] I have included in the "Environment" section below the output of pip freeze.
[ x ] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I've started using AllenNLP since 2018, and I have already run thousands of NER benchmarks with it...since ELMo, and following with transformers, it's CrfTagger model has always yielded superior results in every possible benchmark for this task. However, since my research group trained different RoBERTa models for Portuguese, we have been conducting benchmarks comparing them with an existing BERT model, but we have been getting inconsistent results compared to other frameworks, such as huggingface's transformers.

Sorted results for AllenNLP grid search on CoNLL2003 using optuna (all berts' results are better than all the robertas'):

Sorted results for huggingface's transformers grid search on CoNLL2003 (all robertas' results are better than all the berts'):

I originally opened this as a question on stackoverflow, as suggested in the issues guidelines (additional details already provided there), but I have failed to discover the problem by myself. I have run several unit tests from AllenNLP, concerning the tokenizers and embedders, and couldn't notice anything wrong, but I'm betting something is definetely wrong in the training process, since the results are so inferior for non-BERT models.

Although I'm reporting details with the current release version, I'd like to point out that I had already run this CoNLL 2003 benchmark with RoBERTa/AllenNLP a long time ago too, so it's not something new. At the time the results for RoBERTa were quite below bert-base, but at the time I just thought RoBERTa wasn't competitive for NER (which is not true at all).

It is expected that the results using AllenNLP are at least as good as the ones obtained using huggingface's framework.

Related issues or possible duplicates

Opened as a question myself

Environment

OS: Linux

Python version: 3.8.13

Output of pip freeze:

aiohttp==3.8.1
aiosignal==1.2.0
alembic==1.8.1
allennlp==2.10.0
allennlp-models==2.10.0
allennlp-optuna==0.1.7
asttokens==2.0.8
async-timeout==4.0.2
attrs==21.2.0
autopage==0.5.1
backcall==0.2.0
base58==2.1.1
blis==0.7.8
bokeh==2.4.3
boto3==1.24.67
botocore==1.27.67
cached-path==1.1.5
cachetools==5.2.0
catalogue==2.0.8
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.1
click==8.1.3
cliff==4.0.0
cloudpickle==2.2.0
cmaes==0.8.2
cmd2==2.4.2
colorama==0.4.5
colorlog==6.7.0
commonmark==0.9.1
conllu==4.4.2
converters-datalawyer==0.1.10
cvxopt==1.2.7
cvxpy==1.2.1
cycler==0.11.0
cymem==2.0.6
Cython==0.29.32
datasets==2.4.0
debugpy==1.6.3
decorator==5.1.1
deprecation==2.1.0
dill==0.3.5.1
dkpro-cassis==0.7.2
docker-pycreds==0.4.0
ecos==2.0.10
elasticsearch==7.13.0
emoji==2.0.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl
entrypoints==0.4
executing==1.0.0
fairscale==0.4.6
filelock==3.7.1
fire==0.4.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
ftfy==6.1.1
future==0.18.2
gensim==4.2.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.2
google-auth==2.11.0
google-cloud-core==2.3.2
google-cloud-storage==2.5.0
google-crc32c==1.5.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.4
greenlet==1.1.3
h5py==3.7.0
hdbscan==0.8.28
huggingface-hub==0.8.1
hyperopt==0.2.7
idna==3.3
importlib-metadata==4.12.0
importlib-resources==5.4.0
inceptalytics==0.1.0
iniconfig==1.1.1
ipykernel==6.15.2
ipython==8.5.0
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.0
jsonnet==0.18.0
jupyter-core==4.11.1
jupyter_client==7.3.5
kiwisolver==1.4.4
krippendorff==0.5.1
langcodes==3.3.0
llvmlite==0.39.1
lmdb==1.3.0
lxml==4.9.1
Mako==1.2.2
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
more-itertools==8.12.0
multidict==6.0.2
multiprocess==0.70.13
murmurhash==1.0.8
nest-asyncio==1.5.5
networkx==2.8.6
nltk==3.7
numba==0.56.2
numpy==1.23.3
optuna==2.10.1
osqp==0.6.2.post5
overrides==6.2.0
packaging==21.3
pandas==1.4.4
parso==0.8.3
pathtools==0.1.2
pathy==0.6.2
pbr==5.10.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pluggy==1.0.0
preshed==3.0.7
prettytable==3.4.1
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.20.0
psutil==5.9.2
pt-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.3.0/pt_core_news_sm-3.3.0-py3-none-any.whl
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-rouge==1.1
py4j==0.10.9.7
pyannote.core==4.5
pyannote.database==4.1.3
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycaprio==0.2.1
pydantic==1.8.2
pygamma-agreement==0.5.6
Pygments==2.13.0
pympi-ling==1.70.2
pyparsing==3.0.9
pyperclip==1.8.2
pytest==7.1.3
python-dateutil==2.8.2
pytz==2022.2.1
PyYAML==6.0
pyzmq==23.2.1
qdldl==0.1.5.post2
regex==2022.8.17
requests==2.28.1
requests-toolbelt==0.9.1
responses==0.18.0
rich==12.1.0
rsa==4.9
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.1.2
scipy==1.9.1
scs==3.2.0
seaborn==0.12.0
sentence-transformers==2.2.2
sentencepiece==0.1.97
sentry-sdk==1.9.8
seqeval==1.2.2
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.9
simplejson==3.17.6
six==1.16.0
sklearn==0.0
smart-open==5.2.1
smmap==5.0.0
sortedcontainers==2.4.0
spacy==3.3.1
spacy-legacy==3.0.10
spacy-loggers==1.0.3
split-datalawyer==0.1.80
SQLAlchemy==1.4.41
srsly==2.4.4
stack-data==0.5.0
stanza==1.4.0
stevedore==4.0.0
tensorboardX==2.5.1
termcolor==1.1.0
TextGrid==1.5
thinc==8.0.17
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
toposort==1.7
torch==1.13.0.dev20220911+cu117
torchvision==0.14.0.dev20220911+cu117
tornado==6.2
tqdm==4.64.1
traitlets==5.3.0
transformers==4.21.3
typer==0.4.2
typing_extensions==4.3.0
umap==0.1.1
Unidecode==1.3.4
urllib3==1.26.12
wandb==0.12.21
wasabi==0.10.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.8.1
zipp==3.8.1

Steps to reproduce

I'm attaching some parameters I used for running the CoNLL 2003 grid search.

Example source:

export BATCH_SIZE=8
export EPOCHS=10
export gradient_accumulation_steps=4
export dropout=0.2
export weight_decay=0
export seed=42

allennlp tune \
    optuna_conll2003.jsonnet \
    optuna-grid-search-conll2003-hparams.json \
    --optuna-param-path optuna-grid-search-conll2003.json \
    --serialization-dir /models/conll2003/benchmark_allennlp \
    --study-name benchmark-allennlp-models-conll2003 \
    --metrics test_f1-measure-overall \
    --direction maximize \
    --skip-if-exists \
    --n-trials $1

optuna_conll2003.jsonnet
optuna-grid-search-conll2003.json
optuna-grid-search-conll2003-hparams.json

The text was updated successfully, but these errors were encountered:

epwalsh · 2022-09-23T16:41:12Z

Hey @pvcastro, a couple questions:

In all experiments (BERT-AllenNLP, RoBERTa-AllenNLP, BERT-transformers, RoBERTa-transformers) were you using the same optimizer?
When you used transformers directly (for BERT-transformers and RoBERTa-transformers) was that a CRF model as well, or was that just using the (Ro|B)ertaForSequenceClassification models?

pvcastro · 2022-09-23T17:35:22Z

Hi @epwalsh , thanks for the feedback!

Yes, I was using the huggingface_adamw optimizer.
No, it wasn't an adaptation with CRF, I used the straight run_ner script script from the hf's examples. But I believe the CRF layer would only improve results, as they usually do with bert models.

epwalsh · 2022-09-23T17:43:09Z

Gotcha! Oh yes, I meant BertForTokenClassification, not BertForSequenceClassification 🤦

So I think the most likely source for a bug would be in the PretrainedTransformerMismatched(Embedder|TokenIndexer). And any differences between BERT and RoBERTa would probably have to do with tokenization. See, for example:

allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

Lines 295 to 311 in 8571d93

    
               def _estimate_character_indices( 
        
                   self, text: str, token_ids: List[int] 
        
               ) -> List[Optional[Tuple[int, int]]]: 
        
                   """ 
        
                   The huggingface tokenizers produce tokens that may or may not be slices from the 
        
                   original text.  Differences arise from lowercasing, Unicode normalization, and other 
        
                   kinds of normalization, as well as special characters that are included to denote 
        
                   various situations, such as "##" in BERT for word pieces from the middle of a word, or 
        
                   "Ġ" in RoBERTa for the beginning of words not at the start of a sentence. 
        
                   This code attempts to calculate character offsets while being tolerant to these 
        
                   differences. It scans through the text and the tokens in parallel, trying to match up 
        
                   positions in both. If it gets out of sync, it backs off to not adding any token 
        
                   indices, and attempts to catch back up afterwards. This procedure is approximate. 
        
                   Don't rely on precise results, especially in non-English languages that are far more 
        
                   affected by Unicode normalization. 
        
                   """

pvcastro · 2022-09-23T17:51:38Z

I was assuming that just running some unit tests from the AllenNLP repository, to confirm that these embedders/tokenizers are producing tokens with the same special tokens as RoBERTa architecture would be enough to discard these. I ran some tests using RoBERTa and confirmed that it's not relying on CLS. Was this too superficial to reach any conclusions?

epwalsh · 2022-09-23T17:54:08Z

I'm not sure. I mean, I thought we did have pretty good test coverage there, but I know for a fact that's one of the most brittle pieces of code in the whole library. It would break all of the time with new releases of transformers. So that's my best guess.

pvcastro · 2022-09-23T17:57:22Z

Do you think it makes sense for me to run additional tests for the embedder comparing embeddings produced by a raw RobertaModel and the actual PretrainedTransformerMismatchedEmbedder? To try to see if they are somehow getting "corrupted" in the framework.

epwalsh · 2022-09-23T18:03:05Z

I guess I would start by looking very closely at the exact tokens that are being used for each word by the PretrainedTransformerMismatchedEmbedder. Maybe pick out a couple test instances to check where the performance gap between the BERT and RoBERTa predictions is largest.

pvcastro · 2022-09-24T15:57:36Z

Ok, thanks!
I'll try testing something like this and will report back.

github-actions · 2022-10-03T16:11:29Z

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

github-actions · 2022-10-12T16:13:21Z

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

pvcastro · 2022-10-12T16:34:04Z

Sorry, I'll try to get back to this next week, haven't had the time yet 😞

epwalsh · 2022-10-12T16:47:03Z

No rush, I thought adding the "question" label would stop @github-actions bot from closing this, but I guess not.

pvcastro added the bug label Sep 13, 2022

github-actions bot added the stale label Oct 3, 2022

github-actions bot closed this as completed Oct 3, 2022

epwalsh added question and removed stale labels Oct 3, 2022

epwalsh reopened this Oct 3, 2022

github-actions bot added the stale label Oct 12, 2022

github-actions bot closed this as completed Oct 12, 2022

epwalsh reopened this Oct 12, 2022

epwalsh added Under Development and removed stale labels Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllenNLP biased towards BERT #5711

AllenNLP biased towards BERT #5711

pvcastro commented Sep 13, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 24, 2022

github-actions bot commented Oct 3, 2022

github-actions bot commented Oct 12, 2022

pvcastro commented Oct 12, 2022

epwalsh commented Oct 12, 2022

AllenNLP biased towards BERT #5711

AllenNLP biased towards BERT #5711

Comments

pvcastro commented Sep 13, 2022

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 23, 2022

epwalsh commented Sep 23, 2022

pvcastro commented Sep 24, 2022

github-actions bot commented Oct 3, 2022

github-actions bot commented Oct 12, 2022

pvcastro commented Oct 12, 2022

epwalsh commented Oct 12, 2022