Optimization opportunity in the fst usage. #130

fulmicoton · 2019-09-01T04:47:44Z

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

The text was updated successfully, but these errors were encountered:

fulmicoton · 2019-09-02T00:30:59Z

After further investigation...

The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).

The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)

akkikiki · 2019-10-22T09:28:08Z

@cmoen Might be of your interest.
I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization opportunity in the fst usage. #130

Optimization opportunity in the fst usage. #130

fulmicoton commented Sep 1, 2019 •

edited

fulmicoton commented Sep 2, 2019

akkikiki commented Oct 22, 2019

Optimization opportunity in the fst usage. #130

Optimization opportunity in the fst usage. #130

Comments

fulmicoton commented Sep 1, 2019 • edited

fulmicoton commented Sep 2, 2019

akkikiki commented Oct 22, 2019

fulmicoton commented Sep 1, 2019 •

edited