Skip to content

fizx/multilingual_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Here's a multi-lingual tokenizer for Lucene and/or Solr.  It's not optimal, but it is simple and used in production on many websolr indexes.

    "巴士阿叔 hello world look arabic: لوحة المفاتيح"
    
    will be tokenized as
    
    "巴", "士", "阿", "叔", "hello", "world", "look", "arabic", "لوحة", "المفاتيح"

About

multilingual tokenizer for lucene/solr

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published