Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab #150

Open
kuroahna opened this issue Apr 20, 2024 · 2 comments

Comments

@kuroahna
Copy link

kuroahna commented Apr 20, 2024

Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.

I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using

cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst

and then I tried tokenizing the example sentence from the docs

> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
     Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   13.96 secs    fish           external
   usr time   13.09 secs    0.00 micros   13.09 secs
   sys time    0.86 secs    0.00 micros    0.86 secs

but it takes around 14 seconds to load the dictionary.

In comparison, mecab is near instant

> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   28.32 millis    fish           external
   usr time    0.00 millis    0.00 micros    0.00 millis
   sys time   31.25 millis    0.00 micros   31.25 millis

I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner struct. In particular, when it runs the read_common function

    fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
    where
        R: Read,
    {
        let mut magic = [0; MODEL_MAGIC.len()];
        rdr.read_exact(&mut magic)?;
        if magic != MODEL_MAGIC {
            return Err(VibratoError::invalid_argument(
                "rdr",
                "The magic number of the input model mismatches.",
            ));
        }
        let config = common::bincode_config();
        let data = bincode::decode_from_std_read(&mut rdr, config)?;
        Ok(data)
    }

It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?; so it seems like bincode deserialization is slow.

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode

According to the rkyv docs, it says

It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.

Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?

Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes

The features such as

pub struct UnkEntry {
    pub cate_id: u16,
    pub left_id: u16,
    pub right_id: u16,
    pub word_cost: i16,
    pub feature: String,
}
pub struct WordFeatures {
    features: Vec<String>,
}

are stored as strings, maybe they can be stored as Vec<u8> instead?

Additional context
I'm using vibrato version 0.5.1

And here are the compiled dictionary sizes

> du -sh system.dic.zst
291M    system.dic.zst

> du -sh system.dic
988M    system.dic
@kampersanda
Copy link
Member

@kuroahna

Thank you very much for the report and suggestions!

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

MeCab uses mmap to load the dictionary, enabling such fast and memory-efficient deserialization.

If Vibrato uses rkyv as you suggested, an equivalent performance would be achieved because rkyv supports mmap. I agree with attempting rkyv. I will try it when I have time.

@kuroahna
Copy link
Author

@kuroahna

Thank you very much for the report and suggestions!

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

MeCab uses mmap to load the dictionary, enabling such fast and memory-efficient deserialization.

If Vibrato uses rkyv as you suggested, an equivalent performance would be achieved because rkyv supports mmap. I agree with attempting rkyv. I will try it when I have time.

Ah! mmap! No wonder mecab can load it so quick! And thanks for the quick response! Excited to see where this one goes 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants