PyTorch BERT Document Classification

Implementation and pre-trained models of the paper Enriching BERT with Knowledge Graph Embedding for Document Classification (PDF). A submission to the GermEval 2019 shared task on hierarchical text classification. If you encounter any problems, feel free to contact us or submit a GitHub issue.

Content

CLI script to run all experiments
WikiData author embeddings (view on Tensorboard Projector)
Data preparation
Requirements
Trained model weights as release files

Model architecture

Installation

Requirements:

Python 3.6
CUDA GPU
Jupyter Notebook

Install dependencies:

pip install -r requirements.txt

Prepare data

GermEval data

Download from shared-task website: here
Run all steps in Jupyter Notebook: germeval-data.ipynb

Author Embeddings

python wikidata_for_authors.py run ~/datasets/wikidata/index_enwiki-20190420.db \
    ~/datasets/wikidata/index_dewiki-20190420.db \
    ~/datasets/wikidata/torchbiggraph/wikidata_translation_v1.tsv.gz \
    ~/notebooks/bert-text-classification/authors.pickle \
    ~/notebooks/bert-text-classification/author2embedding.pickle

# OPTIONAL: Projector format
python wikidata_for_authors.py convert_for_projector \
    ~/notebooks/bert-text-classification/author2embedding.pickle
    extras/author2embedding.projector.tsv \
    extras/author2embedding.projector_meta.tsv

Reproduce paper results

Download pre-trained models: GitHub releases

Available experiment settings

Detailed settings for each experiment can found in cli.py.

task-a__bert-german_full
task-a__bert-german_manual_no-embedding
task-a__bert-german_no-manual_embedding
task-a__bert-german_text-only
task-a__author-only
task-a__bert-multilingual_text-only

task-b__bert-german_full
task-b__bert-german_manual_no-embedding
task-b__bert-german_no-manual_embedding
task-b__bert-german_text-only
task-b__author-only
task-b__bert-multilingual_text-only

Enviroment variables

TRAIN_DF_PATH: Path to Pandas Dataframe (pickle)
GPU_ID: Run experiments on this GPU (used for CUDA_VISIBLE_DEVICES)
OUTPUT_DIR: Directory to store experiment output
EXTRAS_DIR: Directory where author embeddings and gender data is located
BERT_MODELS_DIR: Directory where pre-trained BERT models are located

Validation set

python cli.py run_on_val <name> $GPU_ID $EXTRAS_DIR $TRAIN_DF_PATH $VAL_DF_PATH $OUTPUT_DIR --epochs 5

Test set

python cli.py run_on_test <name> $GPU_ID $EXTRAS_DIR $FULL_DF_PATH $TEST_DF_PATH $OUTPUT_DIR --epochs 5

Evaluation

The scores from the result table can be reproduced with the evaluation.ipynb notebook.

How to cite

If you are using our code, please cite our paper:

@inproceedings{Ostendorff2019,
    address = {Erlangen, Germany},
    author = {Ostendorff, Malte and Bourgonje, Peter and Berger, Maria and Moreno-Schneider, Julian and Rehm, Georg},
    booktitle = {Proceedings of the GermEval 2019 Workshop},
    title = {{Enriching BERT with Knowledge Graph Embedding for Document Classification}},
    year = {2019}
}

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
extras		extras
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
config.py		config.py
data_utils.py		data_utils.py
evaluation.ipynb		evaluation.ipynb
experiment.py		experiment.py
germeval-data.ipynb		germeval-data.ipynb
models.py		models.py
requirements.txt		requirements.txt
wikidata_for_authors.py		wikidata_for_authors.py

License

malteos/pytorch-bert-document-classification

Folders and files

Latest commit

History

Repository files navigation

PyTorch BERT Document Classification

Content

Model architecture

Installation

Prepare data

GermEval data

Author Embeddings

Reproduce paper results

Available experiment settings

Enviroment variables

Validation set

Test set

Evaluation

How to cite

References

License

About

Resources

License

Stars

Watchers

Forks

Languages