Skip to content

wangcongcong123/insite_retrieve

Repository files navigation

InsiteRetriever: A web app for retrieving contents within a site.

Motivation

This app is inspired by the fact that many bloggers who are skilled in one field/domain or another maintain blogging sites to share their knowledge but there are usually no searching boxes provided on their sites. However, more broadly, this app usage is not limited in blogging sites but can be applied to any other type of sites.

Due to the lack of resources in comparison to big search engines like Google, this functions only for in-site searching while at paragraph level. Hopefully this helps information seekers to get domain-specific knowledge efficiently from an expert's (e.g., the bloggers) site or any other type of sites.

Demo (Notebook)

demo

Features

Quick Start

git clone https://github.com/wangcongcong123/insite_retrieve.git
cd insite_retrieve
pip install -r requirements.txt


apt-get install maven -qq
git clone https://github.com/castorini/anserini.git
cd anserini
mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true
cd ..

streamlit run app.py

Todo ideas

Let me know if any questions, feedback is welcome. Contributions or pull requests are highly encouraged. Below gives some Todo ideas.

  • Now the system only works using sparse retrieval-based BM25 model, so more work can go to extend the system to support dense retrieval-based techniques such as the recent advance: RetriBERT.
  • Now the system only includes a word cloud image indicating what the topics of a site are generally about, so more work can go to add more features such as topic modelling on the site's contents.
  • More work can go to the presentation of retrived results, such as presenting them with more supplementary data (title, paragraph original location), highlighting exactly-matched words, better rendering etc.
  • Now the contents from each site are coarsely extracted, more filtering strategies are expected for improving the quality of the extracted contents.