Skip to content

FTS helpers for Polish language

License

Notifications You must be signed in to change notification settings

zgoda/polish-fts

Repository files navigation

Full Text Search helpers for Polish language

This is research project to determine usability of all available stemmer implementations for Polish language in context of augmenting simple search engines that do not support Polish language directly. Apart of search engines based on Lucene (Solr and ElasticSearch) none of search engines supports full text search in Polish.

Intended usage model is to store only stems and then apply stemmed queries so there's non-trivial requirement that the implementation a) does not take too much time to load or b) does not eat too much RAM if it stays preloaded in running process. As a baseline full text search engine SQLite FTS5 table with simple tokeniser will be used. Since both MySQL and Postgres so not support anything more, the overhead added by helper code will be the same for both database engines.

The idea

The code will be modeled after usual web application that processes requests and returns responses. The application code will stay in memory for some time and will be restarted after each 100 requests (no matter read or write) to simulate web server worker rotation. Data ingress will be performed by external/background task resembling queue handler, which is the most commonly used pattern in web applications. The eggress will be direct.

Currently evaluated stemmers

  • pystempel, Python implementation of 1st fully functional Polish stemmer Stempel
  • Polish implementation of widely used Porter stemming algoithm

Additionally Eugenia Oshurko's FST-based stemmer for Polish will be evaluated but since the implementation does not have any accompanied license it will not be included in helper's reference implementation.