Skip to content

MikulasZelinka/discworld-hex

Repository files navigation

Discworld Hex

Hex clusters Discworld's stories.

Clustering and search tool applied to plots of Discworld novels. Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.

This is just a tiny proof-of-concept of using FAISS with transformer language models that could be easily extended to cover much larger datasets.

Setup

Should work out of the box with bash and a couple of prerequisites:

( cd conda && source bootstrap.sh )
conda activate discworld-hex
poetry install

Usage

TL;DR (when poetry is installed and the discworld-hex conda env is activated):

build
search

To only fetch data and build and export the index:

build
# is just a shortcut for:
poetry run build

To use the index to search:

search
# is just a shortcut for:
poetry run search

To run any python script in this project:

poetry run python src/discworld_hex/any_file.py

To run all checks:

poetry run pre-cmmit

TODO

Functionality

(What the user would notice.)

  • Allow custom wikipedia queries on the input (and thus custom libraries)
  • Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains
  • Aggregate search results per-book
  • Allow merging libraries
  • Better CLI, allow to change k, pass in multiple sentences, etc., either:
    • clickify and richify the interface
    • Alternatively, just make it into an API
  • Support other (faster, less accurate) indexes

Maintenance

(What the user shouldn't notice.)

  • Less redundant library serialization
  • More tests
    • Rebuilding Library and the FAISS index

Releases

No releases published

Packages

No packages published