Skip to content

Kyubyong/bert-token-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bert Pretrained Token Embeddings

BERT(BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) yields pretrained token (=subword) embeddings. Let's extract and save them in the word2vec format so that they can be used for downstream tasks.

Requirements

  • pytorch_pretrained_bert
  • NumPy
  • tqdm

Extraction

  • Check extract.py.

Bert (Pretrained) Token Embeddings in word2vec format

Models # Vocab # Dim Notes
bert-base-uncased 30,522 768
bert-large-uncased 30,522 1024
bert-base-cased 28,996 768
bert-large-cased 28,996 1024
bert-base-multilingual-cased 119,547 768 Recommended
bert-base-multilingual-uncased 30,522 768 Not recommended
bert-base-chinese 21,128 768

Example

  • Check example.ipynb to see how to load (sub-)word vectors with gensim and plot them in 2d space using tSNE.

  • Related tokens to look

* Related tokens to ##go

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published