Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization opportunity in the fst usage. #130

Open
fulmicoton opened this issue Sep 1, 2019 · 2 comments
Open

Optimization opportunity in the fst usage. #130

fulmicoton opened this issue Sep 1, 2019 · 2 comments

Comments

@fulmicoton
Copy link

fulmicoton commented Sep 1, 2019

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

@fulmicoton
Copy link
Author

After further investigation...

The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).

The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)

@akkikiki
Copy link
Contributor

@cmoen Might be of your interest.
I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants