Label-Wise Pre-Training (LW-PT)

This is the code for NLPCC 2020 paper Label-Wised Document Pre-Training for Multi-Label Text Classification

Requirements

Ubuntu 16.04
Python >= 3.6.0
PyTorch >= 1.3.0

Reproducibility

--data and --outputs

We provide the proprecessed RMSC and AAPD datasets and pretrained checkpoints of LW-LSTM+PT+FT model and HLW-LSTM+PT+FT model to make sure reproducibility. Please download from the link and decompress to the root directory of this repository.

--data
    |--aapd
    	|--label_test
    	|--label_train
    	...
    |--rmsc
    	|--rmsc.data.test.json
    	|--rmsc.data.train.json
    	|--rmsc.data.valid.json
    aapd_word2vec.model
    aapd_word2vec.model.wv.vectors.npy
    aapd.meta.json
    aapd.pkl
    rmsc_word2vec.model
    rmsc_word2vec.model.wv.vectors.npy
    rmsc.meta.json
    rmsc.pkl
--outputs
    |--aapd
    |--rmsc

Note that the data/aapdand data/rmsc is the initial dataset. Here we provide a split of RMSC (i.e. RMSC-V2).

Testing on AAPD

python classification.py -config=aapd.yaml -in=aapd -gpuid [GPU_ID] -test

Testing on RMSC

python classification.py -config=rmsc.yaml -in=rmsc -gpuid [GPU_ID] -test

Preprocessing

If you want to preprocess the dataset by yourself, just run the following command with name of dataset (e.g. RMSC or AAPD).

PYTHONHASHSEED=1 python preprocess.py -data=[RMSC/AAPD]

Note that PYTHONHASHSEED is used in word2vec.

Pre-Train

Pre-train the LW-PT model.

python pretrain.py -config=[CONFIG_NAME] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test

CONFIG_NAME: aapd.yaml or rmsc.yaml
OUT_INFIX: infix of outputs directory contains logs and checkpoints

MLTC Task

Train the downstream model for MLTC task.

python classification.py -config=[CONFIG_NAME] -in=[IN_INFIX] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test

IN_INFIX: infix of inputs directory contains pre-trained checkpoints

Others

build a static documents representation to facilitate downstream tasks

python build_doc_rep.py -config=[CONFIG_NAME] -in=[IN_INFIX] -gpuid [GPU_ID]

Not used unless necessary.

make RMSC-V2 dataset: tests/make_rmsc.py
visual document embeddings: tests/visual_emb.py
visual labels F1 score: tests/visual_label_f1.py
case study: tests/case_study.py

Reference

If you consider our work useful, please cite the paper:

@inproceedings{liu2020label,
	title="Label-Wise Document Pre-Training for Multi-Label Text Classification",
	author="Han Liu, Caixia Yuan and Xiaojie Wang",
	booktitle="CCF International Conference on Natural Language Processing and Chinese Computing",
	year="2020"
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
config		config
data		data
datareaders		datareaders
models		models
outputs		outputs
preprocessors		preprocessors
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_doc_rep.py		build_doc_rep.py
classification.py		classification.py
preprocess.py		preprocess.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt

License

laddie132/LW-PT

Folders and files

Latest commit

History

Repository files navigation

Label-Wise Pre-Training (LW-PT)

Requirements

Reproducibility

Preprocessing

Pre-Train

MLTC Task

Others

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages