Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
docs		docs
scripts		scripts
.gitignore		.gitignore
BERT Classificator2.ipynb		BERT Classificator2.ipynb
BERT Studies.ipynb		BERT Studies.ipynb
BERT2Vec.ipynb		BERT2Vec.ipynb
README.md		README.md
annotation_data.py		annotation_data.py
annotations_dataset.csv		annotations_dataset.csv
annotations_method.csv		annotations_method.csv
bert_classificator.py		bert_classificator.py
bert_cnn_classificator.py		bert_cnn_classificator.py
bert_cnn_predictor.py		bert_cnn_predictor.py
bert_predictor.py		bert_predictor.py
bert_rf_classificator.py		bert_rf_classificator.py
lightning_base.py		lightning_base.py
lightning_utils.py		lightning_utils.py
preds_labels.csv		preds_labels.csv
requirements.txt		requirements.txt
trained_bert_cnn_classificator.py		trained_bert_cnn_classificator.py
usage-classificator.sh		usage-classificator.sh

README.md

Usage classification for Information Extraction

This module is part two of our classification pipeline. For more information about the general approach, visit the classification pipeline module.

For our initial assessment, we analyzed the following four different usage classification models.

The baseline model vectorizes all sentences using TF-IDF and then performs the classification task using Random Forests.
The second approach converts an input sentence into a 768 dimensional sentence embedding using a pre-trained SciBERT model. This embedding is again used in conjunction with Random Forests to predict the usage label.
The third model directly fine-tunes the SciBERT model for the usage classification task. The usage prediction directly uses the trained model.
The last usage classification model uses a pre-trained SciBERT model in conjunction with a Convolutional Neural Network. First, a 768 dimensional token embedding is generated for up to 512 tokens in a sentence. Next, this input is fed into a CNN which then performs the usage classification.
[1] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text.
[2] Kim, Y. (2014). Convolutional neural networks for sentence classification.

method	recall	precision	acc	f1	acc_and_f1
Baseline: Random Forest (Max Depth=3)	0.83	0.56	0.60	0.67	0.64
SciBERT + Random Forest	0.71	0.77	0.73	0.74	0.73
SciBERT (fine-tuned) Sequence-Classification	0.92	0.73	0.80	0.81	0.80
SciBERT + KimCNN	0.79	0.76	0.78	0.77	0.77

method	recall	precision	acc	f1	acc_and_f1
Baseline: Random Forest (Max Depth=5)	0.76	0.69	0.71	0.72	0.72
SciBERT + Random Forest	0.74	0.76	0.75	0.75	0.75
SciBERT Sequence-Classification	0.84	0.76	0.79	0.80	0.79
SciBERT + KimCNN	0.91	0.75	0.81	0.83	0.82

dataset	recall	precision	acc	f1	acc_and_f1
Baseline: Random Forest (Max Depth=3)	0.83	0.56	0.60	0.67	0.64
SciBERT + Random Forest	0.81	0.71	0.77	0.76	0.77
SciBERT (fine-tuned) Sequence-Classification	0.89	0.76	0.83	0.82	0.83
SciBERT + KimCNN	0.95	0.52	0.59	0.67	0.63

dataset	recall	precision	acc	f1	acc_and_f1
Baseline: Random Forest (Max Depth=5)	0.76	0.69	0.71	0.72	0.72
SciBERT + Random Forest	0.84	0.73	0.79	0.78	0.79
SciBERT Sequence-Classification	0.96	0.70	0.80	0.81	0.80
SciBERT + KimCNN	0.92	0.54	0.59	0.67	0.63

See requirements.txt for Python package requirements.

To train or validate the models, use the usage-classificator.sh script:

For example: ./usage-classificator.sh train method bert False