Text embedding based on pre-trained BERT model

Embedding a text to a vector by pre-trained BERT word embeddings and pooling layers

Using the ebedding vectors of two texts to measure the their similarity, by the dot-product of the two vectors

Instillation

git clone https://github.com/gaoyuanliang/bert_text_embedding.git
cd bert_text_embedding
pip3 install -r requirements.txt

wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

Usage

Within one line of code, you can convert a text to a list of 2048 numbers. We call this list the embedding vector.

>>> import numpy 
>>> from jessica_text_embedding import text_embedding
>>> 
>>> x1 = text_embedding(u"Abu Dhabi Finance")
>>> x2 = text_embedding(u"Dubai Islam Bank")
>>> x3 = text_embedding(u"This is a negative")

check the embedding vector and the length of the vector

>>> print(x1)
[0.01866205781698227, 0.20125442743301392, -0.2105996310710907, -0.5797083973884583, 0.5044286847114563, -0.00011515617370605469, -0.9871041178703308, 0.45565372705459595,..., 1.3363279104232788]
>>> len(x1)
2048

check the similarity of (x1,x2) and (x1,x3) measured by the dot-product

>>> numpy.dot(numpy.array(x1), numpy.array(x2))
626.4380145019333
>>> numpy.dot(numpy.array(x1), numpy.array(x3))
591.97747068839

The similarity between "Abu Dhabi Finance" and "Dubai Islam Bank" is largher than its similarity to "This is a negative". Since these three texts have no overlapping words at all, why "Abu Dhabi Finance" is more similar to "Dubai Islam Bank" than "This is a negative"? Because the BERT word embedding has the semantic similarities.

REST API Docker of text embedding service

build docker image

git clone https://github.com/gaoyuanliang/bert_text_embedding.git
cd bert_text_embedding
docker build -t jessica_text_embedding:1.0.1 .

run the docker image

docker run -it -p 5573:9000 --memory="256g" [DOCKER IMAGE ID]

check the service at http://localhost:5573/

TODO

Building more layers on top of the embedding of words to train these layers by supervision of similar/dissimilar pairs of texts.

If you are working on similar projects, I am happy to help. Contact me by yanliang.kaust@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
1810.04805.pdf		1810.04805.pdf
Dockerfile		Dockerfile
README.md		README.md
WX20200826-115118@2x.png		WX20200826-115118@2x.png
bert_text_embedding_similarity.gif		bert_text_embedding_similarity.gif
jessica_text_embedding.py		jessica_text_embedding.py
requirements.txt		requirements.txt
server_path.py		server_path.py
text_embedding_rest_api_docker_demo.gif		text_embedding_rest_api_docker_demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1810.04805.pdf

1810.04805.pdf

Dockerfile

Dockerfile

README.md

README.md

WX20200826-115118@2x.png

WX20200826-115118@2x.png

bert_text_embedding_similarity.gif

bert_text_embedding_similarity.gif

jessica_text_embedding.py

jessica_text_embedding.py

requirements.txt

requirements.txt

server_path.py

server_path.py

text_embedding_rest_api_docker_demo.gif

text_embedding_rest_api_docker_demo.gif

Repository files navigation

Text embedding based on pre-trained BERT model

Instillation

Usage

REST API Docker of text embedding service

TODO

About

Releases 2

Packages

Languages

yanliang12/bert_text_embedding

Folders and files

Latest commit

History

Repository files navigation

Text embedding based on pre-trained BERT model

Instillation

Usage

REST API Docker of text embedding service

TODO

About

Topics

Resources

Stars

Watchers

Forks

Languages