Skip to content

Doarakko/vector-text-similarity-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector text similarity search

Search for similar documents using Elasticsearch and BERT.
This assumes Japanese sentences.

Requirement

  • docker-compose

Usage

$ docker-compose up --build

Init

  1. Download and set up the model file

Download from here.

Rename download files like this.

$ ls bertserver/model
bert_config.json			bert_model.ckpt.meta			wiki-ja.model
bert_model.ckpt.data-00000-of-00001	graph.pbtxt				wiki-ja.vocab
bert_model.ckpt.index			vocab.txt
  1. Go to JupyterLab(http://0.0.0.0:8888/lab) and open terminal

  2. Create Elasticsearch index

$ python create_index.py --index_file index.json --index_name vector_search
  1. Create Elasticsearch documents
$ python create_documents.py --data contents.csv --save contents.json --index_name vector_search
  1. Index Elasticsearch documents
$ python index_document.py --data contents.json
  1. Open main.ipynb and run

Hints

Expected data format

csv and japanese are expected.

content
私は仕事中によく居眠りをしてしまいます。眠気を覚ます方法を教えて下さい。

content can be multiple sentences.
It is split into one sentence during preprocessing

Reference