Skip to content

This repository contains the code to learn subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.

License

Notifications You must be signed in to change notification settings

anr-delices/subword-vectors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

subword embeddings trained on arXiv

This repository contains the code to build subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.

Prerequisites

[Download the arXiv dataset], decompress archive.zip and place the file arxiv-metadata-oai-snapshot.json into the data/ directory.

Install required Python modules:

pip3 install -r requirements.txt

Follow the instructions to build and install SentencePiece command line tools from C++ source.

Follow the instructions to build and install GloVe.

Train subword embeddings from the arXiv dataset

We follow the idea of pre-trained subword embbeddings from (Heinzerling and Strube, 2018).

# Extract the textual content from the arXiv dataset
# this creates a one-sentence-per-line raw corpus file
# 12,807,583 lines
python3 src/extract.py data/arxiv-metadata-oai-snapshot.json \
        data/arxiv-metadata-oai-snapshot.txt

# Train a sentencePiece model from the corpus file
spm_train --input=data/arxiv-metadata-oai-snapshot.txt \
          --model_prefix=data/arxiv-metadata-oai-snapshot \
          --vocab_size=10000

# Encode the corpus file using the sentencePiece model
spm_encode --model=data/arxiv-metadata-oai-snapshot \
           --output_format=piece \
           < data/arxiv-metadata-oai-snapshot.txt \
           > data/arxiv-metadata-oai-snapshot.piece

# Train the subword GloVe vectors
# script adapted from https://github.com/stanfordnlp/GloVe/blob/master/demo.sh
./src/train-glove.sh

Download pre-trained models

Pre-trained models are available in the data/ directory.

  • data/arxiv-metadata-oai-snapshot.model is the sentencePiece model.
  • data/arxiv-metadata-oai-snapshot.vocab is the sentencePiece vocabulary file.
  • data/vectors.txt and data/vectors.bin are learned GloVe vectors (50 dim).

About

This repository contains the code to learn subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published