Tokenizer is not serializable for Apache Spark #85

lamrongol · 2015-11-02T12:48:14Z

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time.
If tokenizers are serializable, we can decrease processing time.

cmoen · 2015-11-02T13:13:38Z

Thanks a lot Fujikawa-san.

Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark?

I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks!

lamrongol · 2015-11-03T00:41:43Z

Spark serialize whole class at the beginning and then process it by each machine parallelly.
Therefore, if unserializable instance is contained it throws error, and you must initialize each time like following link
http://www.intellilink.co.jp/article/column/bigdata-kk01.html

lamrongol · 2015-11-03T13:46:57Z

I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble

lamrongol · 2015-11-04T02:36:32Z

This is changes I made(Sorry, unnecessary space diff included)
https://github.com/lamrongol/kuromoji/commit/415e0fbc242d891e0708aaeacbb7a18ed478fee9

by using my tool
https://github.com/lamrongol/MakeJavaClassSerializable

akkikiki · 2015-11-04T04:59:51Z

I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark.
Perhaps Fujikawa-san is trying do something similar to it?

It is interesting that there is also a downside on this:

The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.

lamrongol · 2015-11-04T05:34:01Z

@akkikiki
If not serializable, Spark doesn't work.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it?
ところで、ここに書いてる人はみな日本語を理解してると思うので日本語で書いても問題ないのではないでしょうか？

lamrongol · 2015-11-05T14:07:40Z

Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer is not serializable for Apache Spark #85

Tokenizer is not serializable for Apache Spark #85

lamrongol commented Nov 2, 2015

cmoen commented Nov 2, 2015

lamrongol commented Nov 3, 2015

lamrongol commented Nov 3, 2015

lamrongol commented Nov 4, 2015

akkikiki commented Nov 4, 2015

lamrongol commented Nov 4, 2015

lamrongol commented Nov 5, 2015

Tokenizer is not serializable for Apache Spark #85

Tokenizer is not serializable for Apache Spark #85

Comments

lamrongol commented Nov 2, 2015

cmoen commented Nov 2, 2015

lamrongol commented Nov 3, 2015

lamrongol commented Nov 3, 2015

lamrongol commented Nov 4, 2015

akkikiki commented Nov 4, 2015

lamrongol commented Nov 4, 2015

lamrongol commented Nov 5, 2015