Bayesian Subspace Multinomial Model (BaySMM)

Model for learning document embeddings (i-vectors) along with their uncertainties.
Gaussian linear classifier exploiting the uncertainties in document embeddings.
See paper http://arxiv.org/abs/1908.07599

S. Kesiraju, O. Plchot, L. Burget and S. V. Gangashetty, "Learning Document Embeddings Along With Their Uncertainties," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2319-2332, 2020, doi: 10.1109/TASLP.2020.3012062.

Requirements

Python >= 3.7
PyTorch >= 1.1 <=1.4
scipy >= 1.3
numpy >= 1.16.4
scikit-learn >= 0.21.2
h5py >= 2.9.0
See INSTALL.md for detailed instructions.

Data preparation - sample from 20Newsgroups

python src/create_sample_data.py.py sample_data/

Training the model

For help:

python src/run_baysmm.py --help
To train on GPU set CUDA_VISIBLE_DEVICES=$GPU_ID where the $GPU_ID is the free GPU index

Following code trains the model for 1000 VB iterations and saves the model in an automatically created sub-directory: exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/

python src/run_baysmm.py train \
    sample_data/train.mtx \
    sample_data/vocab \
    exp/ \
    -K 50 \
    -trn 1000 \
    -lw 1e+01 \
    -var_p 1e+01 \
    -lt 1e-03

ELBO and KLD for every iteration, log file, etc are saved in the sub-directory.

Extracting the posterior distributions of embeddings

Extract embeddings [mean, log.std.dev] for 1000 iterations for each of the stats file present in sample_data/mtx.flist file list.

Using -nth 100 argument, embeddings for every 100th iteration are also saved.

python src/run_baysmm.py extract \
    sample_data/mtx.flist \
    exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/model_T1000.h5 \
    -xtr 1000 \
    -nth 100

Extracted embedding posterior distributions are saved in exp/*/ivecs/ sub-directory with appropriate names.

Training and testing the classifier

Three classifiers can be trained on these embeddings.
Use --final option to train and test classifier on embeddings from the final iteration.

Gaussian linear classifier - uses only the mean parameter

python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels glc
Multi-class logistic regression - uses only the mean parameter

python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels lr
Gaussian linear classifier with uncertainty - uses full posterior distribution

python src/train_and_clf_cv.py exp/s_1.00_rp_1_lw_1e+01_l1_1e-03_50_adam/ivecs/train_model_T1000_e1000.h5 sample_data/train.labels glcu

All the results and predicted classes are saved in exp/*/results/

Citation

@ARTICLE{Kesiraju:2020:BaySMM,
  author={Kesiraju, Santosh and Plchot, Oldřich and Burget, Lukáš and Gangashetty, Suryakanth V.},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Learning Document Embeddings Along With Their Uncertainties}, 
  year={2020},
  volume={28},
  number={},
  pages={2319-2332},
  doi={10.1109/TASLP.2020.3012062}}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
INSTALL.md		INSTALL.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

INSTALL.md

INSTALL.md

README.md

README.md

Repository files navigation

Bayesian Subspace Multinomial Model (BaySMM)

Requirements

Data preparation - sample from 20Newsgroups

Training the model

Extracting the posterior distributions of embeddings

Training and testing the classifier

Citation

About

Releases

Packages

Languages

skesiraju/BaySMM

Folders and files

Latest commit

History

Repository files navigation

Bayesian Subspace Multinomial Model (BaySMM)

Requirements

Data preparation - sample from 20Newsgroups

Training the model

Extracting the posterior distributions of embeddings

Training and testing the classifier

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages