Neural Embeddings for Text Representation Learning

The series of experiments focus on classifying mail data sets using Mono-Language and Cross-Language Embeddings. The main purpose for this cross-language experiment is to transfer the learning from resource rich language (labelled English mail data sets) to resource-poor language (unlabelled Dutch mail data sets - suspected as phising mails), as such it can distinguish legitimate vs. non-legitimate mails across languages.

version 1.0:

For sequence-to-sequence learning of bi-lingual documents, a parallel corpus of English and Dutch document is used http://www.statmt.org/europarl/. A compact version of pre-processed data (python dictionary format) will be shared in data/.
The resulting trained weights from bi-lingual sequence model (machine translation task) will be shared in weights/ for further analysis purpose.
labelled mono-language mail datasets (English language) used in this experiment: Enron mail data set, Lingspam mail data set, Spamassasin mail data set. A compact pre-processed labelled data in python dictionary format will be shared in data/ for reproducible research.
unlabelled mono-language mail data sets (Dutch language) is sampled from raw mail data suspected as phising emails. The data won't be publicly available, but the author will use another sample set for tutorial purposes (if necessary).

For the tutorial purpose of the codes:

Pre-processing : preprocessing.ipynb
Sequence-to-sequence learning of bilingual parallel corpora : bilingual_learning.ipynb
Analysing the weights : analyse_weights.ipynb

Requirements (List of dependencies and installation can also be found in installation.md):

Python 2.7 + (Anaconda2 or Miniconda2)
tensorflow / theano
keras
nltk
gensim

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
scripts		scripts
INSTALL.md		INSTALL.md
README.md		README.md
REFERENCES.md		REFERENCES.md
download_data.py		download_data.py
preprocessing.ipynb		preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

INSTALL.md

INSTALL.md

README.md

README.md

REFERENCES.md

REFERENCES.md

download_data.py

download_data.py

preprocessing.ipynb

preprocessing.ipynb

Repository files navigation

Neural Embeddings for Text Representation Learning

About

Releases

Packages

Languages

inimah/Neural-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Neural Embeddings for Text Representation Learning

About

Topics

Resources

Stars

Watchers

Forks

Languages