Modelling Commonsense Properties using Pre-Trained Bi-Encoders
The BiEncoder model for concept property classification consists of two separate pre-trained Language Model (LM) based encoders. The concept encoder is trained on the prompt concept means [MASK]
and the property encoder on property means [MASK]
. The vector encoding for the [MASK]
is taken as the representation for the concept or the property. The dot product of the [MASK]
embeddings of the concept and property is passed through the sigmoid activation to get the model prediction.
The BiEncoder model can generate the embeddings of concept and properties. Please run the following scripts to download our pretrained model and generate the embeddings.
sh download_models.sh
# For generating embeddings from BERT base model
python3 get_embedding.py --config_file configs/generate_embeddings/get_concept_property_embeddings.json
# For generating embeddings from BERT large model
python3 get_embedding.py --config_file configs/generate_embeddings/get_concept_property_embeddings_bert_large.json
The download_models.sh
will download BERT-base-uncased
and BERT-large-uncased
models pretrained on ConceptNet data in Generics KB and the has_property
relation data in the Concept Net.
The default configurations for generating the concept/property embeddings from BERT base model are mentioned in the configuration file - configs/generate_embeddings/get_concept_property_embeddings.json
.
For using our BERT large the default configuration are in - configs/generate_embeddings/get_concept_property_embeddings_bert_large.json
configuration file.
- ******* To use the
bert-large-uncased
model trained on chatgpt data specify the pretrained model name as -bienc_bert_large_chatgpt100k_pretrain.pt
******* - ******* To use the
bert-large-uncased
model trained on conceptnet premium and chatgpt data specify the pretrained model name as -entropy_cnetp_chatgpt100k_bert_large_uncased.pt
*******
From the downloaded model, by default the above script will generate the concept embeddings as input_data_type
field is concept in the configuration file. The concepts are taken from the input file data/generate_embeddding_data/dummy_concepts.txt
. The embeddings will be saved in trained_models/embeddings
path as a pickled dictionary with concepts as key and their embdding as value.
-
To generate the embeddings of your own data, following is the explanation of the fields of the configuration file:
dataset_name
- Name that will be used to save the embedding pickle file at the directory path specified insave_dir
field.hf_checkpoint_name
andhf_tokenizer_name
- The huggingface pretrained model ID and tokenizer name. For example,bert-base-uncased
.context_num
- Context ID used in pretraining the models. To get the correct embeddings please keep it 6.pretrained_model_path
- Path of the pretrained model. It istrained_models/bb_gkb_cnet_plus_cnet_has_property.pt
.get_con_prop_embeds
- Flag set totrue
to get concept or property embeddings.input_file_name
: Path of the input concept, property or concept and property file.input_data_type
- Type of the embddings to generate.concept
: for concept embeddings. The input file must be a file with each concept in one line.property
: for property embeddings. The input file must be a file with each property in one line.concept_and_property
: for the concept and property embeddings. The input file must file with each concept and associated property, one per line, separated by tab.
We train the BiEncoder model with contrastive loss and also jointly with cross-entropy loss. These models are BERT-large and BERT-base based. To download the models run the download_models.sh
script.
To generate the embeddings from BERT-large model use the configuration file - configs/generate_embeddings/get_concept_property_embeddings_contrastive_bert_large.json
. To get the embeddings change the pretrained_model_path
in the configuration file to one of the following:
infonce_cnetp_chatgpt100k_conceptfix_bert_large_uncased.pt
- Concept centric contrastive model.infonce_cnetp_chatgpt100k_propertyfix_bert_large_uncased.pt
- Property centric contrastive model.
To generate the embeddings from BERT-base model use the configuration file - configs/generate_embeddings/get_concept_property_embeddings_contrastive_bert_base.json
. To get the embedding change the pretrained_model_path
in the configuration file to one of the following:
-entropy_infonce_joint_loss_cnetp_pretrain_bb_bienc_bert_base_uncased.pt
: Model jointly trained on contrastive and cross-entropy loss.
-contastive_bienc_cnetp_pretrain_bert_base_uncased.pt
: Contrastive model - Model where concept and its positive properties are close in embedding space than the negative properties.
-prop_fix_bienc_infonce_bert_base_cnetp_pretrain.pt
: Contrastive model - Model where property and the concept it applies to are close than the concepts to which property do not apply.
-conprop_fix_infonce_cnetp_pretrain_bb_bienc_bert_base_uncased.pt
: Contrastive model - Model jointly trained on with above two criterion.
The biencoder model is first trained on the different types and amounts (100K and 500K) of data from the Microsoft Concept Graph (mscg)
, Generics KB Properties (gkb)
and Prefix Adjectives
. The data can be found in the data
directory of the repo. The model in this configuration uses in-batch negative sampling. The input file is a tsv
in the form of concept property
. The negatives are sampled via in-batch negative sampling during model training.
Following are the steps to train the model:
-
Clone the repo and checkout the
neg_batch_sampling
branch:- git clone git@github.com:amitgajbhiye/biencoder_concept_property.git
- cd biencoder_concept_property/
-
Create
logs
andtrained_models
directories:- mkdir logs trained_models
-
The model is trained with a configuration file that contains all the parameters for the datasets, model and training.
-
The log file for the experiments are created in the
logs
directory. This can be changed in theset_logger
function in theutils/functions.py
module. The name of the log file is of the formlog_experiment_name_timestamp
. Theexperiment_name
comes from config file and timestamp is current timestamp. -
trained_models
is the directory where the trained model wil be saved. This can be changed in theexport_path
parameter of the config file. -
In the config file, change the
hf_tokenizer_path
andhf_model_path
to the paths of the downloaded tokenizer and pretrained language model. -
To train the model execute the
run_model.py
script with the config file path as an argument. -
For example, to train the model on the 100K mscg data. Run the following command:
python run_model.py --config_file configs/sample_configs/top_100k_mscg_config.json
-
The configuration files I used for the experiments are in self-descriptive directory names in the
configs
directory. The names of the config files and data files are also self-descriptive. -
The best-trained model is saved at the path specified in the
export_path
with the name specified in themodel_name
parameter of the configuration file. -
The models trained on 100k and 500k different datasets are saved in One Drive at the link
The models trained above are fine tuned on the on the extended McRae dataset. The processed train file is data/evaluation_data/extended_mcrae/train_mcrae.tsv
and the test file is data/evaluation_data/extended_mcrae/test_mcrae.tsv
.
On the McRae data, the model is fine-tuned in three splits of the whole data:
- Default - Concept Split
- Property Split
- Concept Property Split
In the Property
and Concept Property
split settings the model uses cross-validation.
Following are the steps to fine-tune a pretrained model:
- To fine tune the trained model use the sample configuration file -
configs/sample_configs/cv_sample_config_file.json
- In the config file, specify the path of the following parameters:
pretrained_model_path
- The path of the pre-trained model that need to be fine-tuned (taken from the 100k and 500k trained model path specified above).do_cv
-true
for property split cross validation and concept property split cross validation.do_cv
isfalse
for finetuning on default concept split.cv_type
frommodel_evaluation_property_split
andmodel_evaluation_concept_property_split
- To fine tuning the model execute the
fine_tune.py
script with the config file path as argument. - For example, to fine the model trained on
100k mscg
data withProperty
split run the following command:python3 fine_tune.py --config_file configs/sample_configs/pcv_sample_config_file.json
@inproceedings{gajbhiye2022modelling,
title = "Modelling Commonsense Properties Using Pre-Trained Bi-Encoders",
author = "Gajbhiye, Amit and
Espinosa-Anke, Luis and
Schockaert, Steven",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.349",
pages = "3971--3983"
}