GitHub - fizx/multilingual_tokenizer: multilingual tokenizer for lucene/solr

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dist		dist
src/com/websolr/misc		src/com/websolr/misc
test/com/websolr/misc		test/com/websolr/misc
LICENSE		LICENSE
README		README

Repository files navigation

Here's a multi-lingual tokenizer for Lucene and/or Solr.  It's not optimal, but it is simple and used in production on many websolr indexes.

    "巴士阿叔 hello world look arabic: لوحة المفاتيح"
    
    will be tokenized as
    
    "巴", "士", "阿", "叔", "hello", "world", "look", "arabic", "لوحة", "المفاتيح"