GitHub - nguyenkh/NeuralDenoising: Neural-based Noise Filtering from Word Embeddings

Neural-based Noise Filtering from Word Embeddings

Kim Anh Nguyen, nguyenkh@ims.uni-stuttgart.de

Code for paper Neural-based Noise Filtering from Word Embeddings (COLING 2016).

Requirements

Sklearn
Theano

Pre-trained word embeddings

The models can filter noise from any pre-trained word embeddings such as word2vec, GloVe
The format of word embeddings used in this code is either word2vec or GloVe (either binary or text)

Preprocessing

This step is to learn the dictionaries for CompEmb and OverCompEmb models; transform complete word embeddings to overcomplete word embeddings.
Running command:

python preprocessing.py -input <original_embs_file> -output <overcomp_file> -factor <factor_overcomplete> -bin <format_file>

For example, transform an input word embeddings of 100 dimensions into overcomplete word embeddings of 1000 dimensions (factor == 10) with binary format:

python preprocessing.py -input sgns_100d.bin -output sgns_overcomp_1000d.bin -factor 10 -bin 1

Training models

Training CompEmb model:

```THEANO_FLAGS="mode=FAST_RUN,device=cpu,floatX=float32" python filter_noise_embs.py -input sgns_100d.bin -output sgns_denoising_100d.bin -iter 30 -bsize 100 -bin 1```

Train CompEmb model with 30 iterations, batch size of 100, and binary format.

Training OverCompEmb model:

```THEANO_FLAGS="mode=FAST_RUN,device=cpu,floatX=float32" python filter_noise_embs.py -input sgns_100d.bin -output sgns_denoising_1000d.bin -over sgns_overcomp_1000d.bin -iter 30 -bsize 100 -bin 1```

Train OverCompEmb model with 30 iterations, batch size of 100, and binary format; sgns_overcomp_1000d.bin is an overcomplete word embeddings.

Reference

@InProceedings{nguyen:2016:denoising
  author    = {Nguyen, Kim Anh and Schulte im Walde, Sabine and Vu, Ngoc Thang},
  title     = {Neural-base Noise Filtering from Word Embeddings},
  booktitle = {Proceedings of the 26th International Conference on Computational Linguistics (COLING)},
  year      = {2016},
  address = {Osaka, Japan},
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
README.md		README.md
common.py		common.py
filter_noise_embs.py		filter_noise_embs.py
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

common.py

common.py

filter_noise_embs.py

filter_noise_embs.py

preprocessing.py

preprocessing.py

Repository files navigation

Neural-based Noise Filtering from Word Embeddings

Requirements

Pre-trained word embeddings

Preprocessing

Training models

Reference

About

Releases

Packages

Languages

nguyenkh/NeuralDenoising

Folders and files

Latest commit

History

Repository files navigation

Neural-based Noise Filtering from Word Embeddings

Requirements

Pre-trained word embeddings

Preprocessing

Training models

Reference

About

Resources

Stars

Watchers

Forks

Languages