Skip to content

Latest commit

 

History

History
19 lines (14 loc) · 1.99 KB

README.md

File metadata and controls

19 lines (14 loc) · 1.99 KB

Training Machines to Learn about Machine Learning

A neural network language model to show relationships between skills

What skills are the most closely related?

Overview

This project trains a skip-gram Word2vec model in order to make accurate inferences about technical skills. The training data comes from 106,000 job descriptions scraped from the Internet, along with all posts from Stackoverflow with a score greater than 25. The resulting model has many practical uses, including the ability to generate a canonical list of technical skills and the ability to elucidate the relationships between them. Using the same analogy tasks that the original Word2vec paper proposed, the baseline accuracy of this model on a set of 1,954 general language model analogy tasks is 18.17%, and the final model achieved 28.08% accuracy on the same set. However, this paper proposes a novel evaluation method designed specifically for this data and task. The evaluation process involves three steps:

  1. Collecting sentences from Wikipedia from articles about technical skills;
  2. “Corrupting” the sentences by replacing the skill words with randomly selected incorrect skill words; and
  3. Evaluating how likely the model deems the set of correct sentences relative to the corrupted sentences.

The best-performing model was a skip-gram Word2vec model with a 300 embedding dimensions and a window size of 15. Adjusting window size had the most significant effect on accuracy.

Data and Models

  • Processed training data can be found here
  • The final model (and its associated files) can be found here
  • The Stackoverflow data in its original form can be found here

For more detailed information, check out the paper (the PDF entitled "Goodgame Word2vec for Skills").