KG-LM-Integration

This is the implementation of Knowledge InteGrated BERT (KIG-BERT) proposed by Aisha and Xiangru for CS 848 (Knowledge Graph) course project.

Paper

Find our paper with training and evaluation details in link-to-paper

Abstract: Recent developments in large language modeling have greatly accelerated the performances of NLP applications. Yet they remain largely dependent on their training data and thus prone to being factually inaccurate and socially biased. It is hard to correct the models after the fact due to their large size requiring high compute and large amounts of supervised training data. This paper proposes a minimal compute, no-pretrain framework for improving language model factual accuracy by incorporating knowledge graph information. Unlike human-written text, facts in knowledge graphs like Wikidata are accurate and free from bias. Comparison with baselines shows that our methods have promise in making language models factually accurate as well as retaining language understanding. We also build a facts dataset to test our work using template sentences and Wikidata entities to further evaluate the proposed system.

Datasets

Linked Wikitext-2: A dataset that connects spans of text to Wikidata entities.
Facts Dataset: A dataset consisting of fact-sentences generated using templates and Wikidata entities collected with SPARQL queries.

Usage

All the experimental results can be reproduced by the jupyter notebook KIG-Bert.ipynb. Detailed documentation and instruction is in the notebook.

Requirements

GPU is needed for training and evaluation.
Git Large File Storage package is needed. Please find the intrustion on the installing it in installing-git-large-file-storage
Python version: 3.10
Install pip packages with pip3 install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
linked-wikitext-2		linked-wikitext-2
models		models
papers		papers
training_notebooks		training_notebooks
.gitignore		.gitignore
KIG-Bert-Full.ipynb		KIG-Bert-Full.ipynb
KIG-Bert-Trial.ipynb		KIG-Bert-Trial.ipynb
KIG-Bert.ipynb		KIG-Bert.ipynb
Language_Model_Knowledge_Graph_Integration.pdf		Language_Model_Knowledge_Graph_Integration.pdf
README.md		README.md
config.py		config.py
data_collator.py		data_collator.py
extract_kg_embeddings.py		extract_kg_embeddings.py
generate_fact_dataset.py		generate_fact_dataset.py
new_model.py		new_model.py
relevant_qid_embedding_in_index_order.pt		relevant_qid_embedding_in_index_order.pt
relevant_qids.csv		relevant_qids.csv
requirements.txt		requirements.txt
sythetic_dataset_w_negative_samples.jsonl		sythetic_dataset_w_negative_samples.jsonl
tokenization.py		tokenization.py

tanny411/KG-LM-Integration

Folders and files

Latest commit

History

Repository files navigation

KG-LM-Integration

Paper

Datasets

Usage

Requirements

About

Resources

Stars

Watchers

Forks

Languages