Word Embeddings

master.py is a script that scrapes text from xml files and creates word embeddings from them, it also builds latent semantic indeces from the xml files. Following this, it uses these embeddings and LSI components to compute similarity between the question and comment and rank them appropriately.

The values are then combined in an (1-α)(score1) + (α)(score2). Alpha(α) was tweaked on the given development set.

Following this, two enhancements are run on the combined set. If a comment has the same userid as the question answerer the ranking of the comment is set to 0. Also a weighted subtraction is made from comments based on length.

The scorer ev.py that was modified and made to run within our script. This was done mainly for convenience.

The process is as follows:

All text is scraped from the training data including question subject, question text and comment text.
HTML characters in the scraped text are removed, words are lowercased, the text is tokenized and all punctutation subsequently filtered out.
Now the corpus is ready for the embedding algorithm. An third party external command line tool, word2vec developed by Google Inc., is used to build word embeddings from the corpus. The variables that can be set are:

dimension of word embeddings
size of window around each word
to use CBOW or Skip-Gram algorithms
the number of iterations to use

We tuned our parameters on the development data set and set the defaults for each of these as seen in the function definition for 'word2vec_embed'

Once the embeddings have been formed, they are called in from an external text file into a local dictionary.
This dictionary is then used to calculate the cosine similarity measure between each Question and its 10 related Comments
The similarity is then output by the system as a prediction file with one prediction per line.
This prediction is then fed into a modified version of the ev.py scorer script given to us for this task.
A similar process is run for the Latent Semantic Analysis.
The embeddings and semantic analysis both produce prediction files. These files are combined into a mixture file.
Following the creation of the mixture file as explained we squash down any comments written by the questioner to 0 and add a weighted comment length to the ranking (also modulated by a beta coefficient).
In the end this final modified mixture file is fed into our scorer again to assess our model.

Built by:

Micah Friedland & Robert Bevan

Pre-requisites: This software was built to run on Python 3+.

The following packages need to be installed for it to run:

regex
re
numpy
scipy
sklearn

The following software needs to be locally installed:

word2vec

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
project		project
.gitignore		.gitignore
README.md		README.md
report.pdf		report.pdf
word2vecExplained.pdf		word2vecExplained.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

.gitignore

.gitignore

README.md

README.md

report.pdf

report.pdf

word2vecExplained.pdf

word2vecExplained.pdf

Repository files navigation

Word Embeddings

License

About

Releases

Packages

Languages

micahzev/WordEmbeddings

Folders and files

Latest commit

History

Repository files navigation

Word Embeddings

License

About

Topics

Resources

Stars

Watchers

Forks

Languages