Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer is not serializable for Apache Spark #85

Open
lamrongol opened this issue Nov 2, 2015 · 7 comments
Open

Tokenizer is not serializable for Apache Spark #85

lamrongol opened this issue Nov 2, 2015 · 7 comments

Comments

@lamrongol
Copy link

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time.
If tokenizers are serializable, we can decrease processing time.

@cmoen
Copy link
Member

cmoen commented Nov 2, 2015

Thanks a lot Fujikawa-san.

Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark?

I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks!

@lamrongol
Copy link
Author

Spark serialize whole class at the beginning and then process it by each machine parallelly.
Therefore, if unserializable instance is contained it throws error, and you must initialize each time like following link
http://www.intellilink.co.jp/article/column/bigdata-kk01.html

@lamrongol
Copy link
Author

I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble

@lamrongol
Copy link
Author

This is changes I made(Sorry, unnecessary space diff included)
https://github.com/lamrongol/kuromoji/commit/415e0fbc242d891e0708aaeacbb7a18ed478fee9

by using my tool
https://github.com/lamrongol/MakeJavaClassSerializable

@akkikiki
Copy link
Contributor

akkikiki commented Nov 4, 2015

I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark.
Perhaps Fujikawa-san is trying do something similar to it?

It is interesting that there is also a downside on this:

The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.

@lamrongol
Copy link
Author

@akkikiki
If not serializable, Spark doesn't work.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it?
ところで、ここに書いてる人はみな日本語を理解してると思うので日本語で書いても問題ないのではないでしょうか?

@lamrongol
Copy link
Author

Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants