Skip to content

Victor0118/cross_domain_embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

cross_domain_embedding emnlp 2017

To run the cre.c, you need to compile it using any C compiler first.

Then:

1). Learn word embeddings from source domain using word2vec toolkit (https://code.google.com/archive/p/word2vec/). Add -bin argument to generate binary output format.

2). Generate a similarity score file. One line per word in '$word$ $score$' format. The word should appear in both the source and the target domain. The score must be a positive real number. The score file in the paper is generated through word frequency from corpus of two domains.

3). Run the cre program. Note that the size parameter (dimension of embeddings) should match the dimension of the embeddings from step 1).

Sample:

./cre -train <target_corpus_file.txt> -model <binary_embedding.bin> -similarity <similarity_score.txt> -output <output_name.bin> -size 50 -window 5 -binary 1 -lambda 10 -threads 20

Please cite the following paper:

@InProceedings{yang-lu-zheng:2017:EMNLP2017,
  author    = {Yang, Wei  and  Lu, Wei  and  Zheng, Vincent},
  title     = {A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {2888--2894},
  url       = {https://www.aclweb.org/anthology/D17-1311}
}

Contact w85yang@uwaterloo.ca if you have further question on the code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages