Skip to content

An implementation of LSA, LDA and BERT for performing semantic search on MS MARCO Dataset

Notifications You must be signed in to change notification settings

zthsk/semantic_search

Repository files navigation

Implementation of LSA, LDA, and SBERT for Semantic Search

This project employs Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Sentence-BERT (SBERT) on the MS MARCO dataset, enabling semantic search functionality across these models. By utilizing GloVe embeddings of the documents and comparing them with provided queries using cosine similarity, it establishes a baseline for model comparison. Key evaluation metrics such as Precision, Average Precision, Recall, F1-Score, and Mean Average Precision (MAP) are computed to assess model performance.

Dependencies

-- nltk
-- tqdm
-- gensim
-- scipy
-- numpy
-- sklearn
-- sentence_transformers
-- Pytorch
-- GloVe embeddings 

Run Locally

Clone the project

    git clone https://github.com/zthsk/semantic_search.git

Go to the project directory

    cd semantic_search

Install dependencies

    pip install nltk
    pip install tqdm
    pip install gensim
    pip install scipy   
    pip install numpy
    pip install scikit-learn
    pip install sentence-transformers
    pip install torch torchvision torchaudio

Train the LSA, LDA, BERT, and GloVe

    python train_models.py --bert sbert_embeddings.npy
    python train_models.py --lsa lsa_model.pny
    python train_models.py --lda lda_model.pny
    python train_models.py --glove glove_embeddings.npy

Query the model with a single query

    python query.py --model [bert, lsa, lda] --query "your query"

Query the model with a list of queries

    ./run_queries.sh  # just update the queries you want in queries.txt

Results of a query with different models

App Screenshot

App Screenshot

App Screenshot

Use the analysis.ipynb file to produce the following images:

App Screenshot

App Screenshot

App Screenshot

About

An implementation of LSA, LDA and BERT for performing semantic search on MS MARCO Dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published