Support request: tokenization #13

nick-magnini · 2016-01-14T18:39:52Z

It would be great to add the org.apache.lucene.analysis for smarter tokenization for all languages. In this way, processing other languages such as Chinese is more sensible with your library.

dav009 · 2016-01-15T19:17:29Z

It would be good to amke the pipeline more independent.
Are you working with Japanese/Korean/Chinese langs ?

nick-magnini · 2016-01-15T19:21:23Z

Yes I do. I usually use the org.apache.lucene.analysis for various languages.

dav009 · 2016-01-15T19:23:58Z

great to know that.does the current pipeline actually gets something out that is not garbage for those langs? (Ive only played with a few of the most remarkable european languages)

nick-magnini · 2016-01-15T22:10:13Z

Go to check it but for languages that are mix of asian and english (e.g., wikipedia) usually smart chinese tokenizer from lucene works well and it's pretty fast and scalable

dav009 · 2016-02-05T17:39:06Z

@nick-magnini any recommendation on which tokenizer to use for this particular task:

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase： "我是中国人"
ChineseAnalyzer: 我－是－中－国－人
CJKAnalyzer: 我是－是中－中国－国人
SmartChineseAnalyzer: 我－是－中国－人

nick-magnini · 2016-02-05T17:53:24Z

Based on my experience, for wiki pages, since it's a mix for English and Chinese, SmartChineseAnalyzer works better. In addition Jieba is one of the best Chinese segmenters (tokenizers). Potentially to change Zh from traditional to simple and vice versa, Opencc is recommended.

dav009 · 2016-02-05T17:57:01Z

So then the best options are:

Jieba: https://github.com/huaban/jieba-analysis (tokenizing)
OpenCC https://github.com/BYVoid/OpenCC (simplifying)

I've not done much of Asian languages, but you are suggesting the pipeline to be:

tokenize + transform from traditional to simple ?

dav009 · 2016-02-05T18:01:53Z

Also I assume we are working on the Chinese wikipedia in this space zh
There are others :

zh-yue (cantonese ?) :
wuu
gan

Does zh have any preference on waht kind of chinese to use(traditional/simple ) ?

nick-magnini · 2016-02-05T18:13:15Z

The transformation should be optional (future plan). simple <-> traditional should be done before tokenization. At this stage I don't think you should care about this stage. I don't bother with it.
Lucene analyzer is faster. Jieba is slower but more precise.

Now the choice is between Lucene or Jieba. In terms of scalability and efficiency, I'l vote for Lucene since it has CKJ support.

dav009 · 2016-02-05T19:02:33Z

Alright lets start with tokenization, So Im going to run the tool on ZH

dav009 · 2016-02-06T00:25:37Z

@nick-magnini generating the model now.

Here is a sample of the tokenization using SmartChineseAnalyzer. Worth knowing if it looks alright

DBPEDIA_ID/倫敦珍寶 倫 敦 珍 寶 DBPEDIA_ID/5月1日 5 月 1 日 葡萄牙语 维 基 百科 达到 40 000 条目 DBPEDIA_ID/4月30日 4 月 30 日 加 利 西亚 语 维 基 百科 达到 5 000 条目 爪哇 语 维 基 百科 达到 500 条目 DBPEDIA_ID/4月29日 4 月 29 日 威 尔 斯 语 维 基 百科 达到 3 000 条目 DBPEDIA_ID/4月28日 4 月 28 日 法语 维 基 语录 达到 2 000 条目 法语 维 基 字典 达到 5 000 条目 DBPEDIA_ID/4月27日 4 月 27 日 芬兰 语 维 基 百科 达到 20 000 条目 保加利亚 语 维 基 字典 达到 20 000 条目 维 基 百科 中文版 达到 26 000 条目 第 26 000 条目 是 user peterpan 创建 的 DBPEDIA_ID/骰寶 骰 寶 DBPEDIA_ID/4月22日 4 月 22 日 车臣 语 维 基 百科 誕 生 DBPEDIA_ID/4月18日 4 月 18 日 维 基 百科 中文版 达到 25 000 条目 第 25 000 条目 是 user hamham 创建 的 DBPEDIA_ID/托马斯·吉尔丁 托 马 斯 吉 尔 丁 DBPEDIA_ID/4月9日 4 月 9 日 维 基 百科 中文版 达到 24 000 条目 第 24 000 条目 是 user sl 创建 的 DBPEDIA_ID/太平山_(香港) 香港 山 頂 DBPEDIA_ID/4月7日 4 月 7 日 DBPEDIA_ID/4月6日 4 月 6 日 维 基 百科 中文版 条目 数 超过 丹麦 语 版 按照 条目 数 排名 位居 所有 语言 的 第 11 名 DBPEDIA_ID/4月4日 4 月 4 日 已经 有 50 本 课本 DBPEDIA_ID/3月31日 3 月 31 日 达到 500 词条 DBPEDIA_ID/3月26日 3 月 26 日 维 基 百科 中文版 达到 23 000 条目 第 23 000 条目 是 wangyunfeng 创建 的 DBPEDIA_ID/刀币 刀币 DBPEDIA_ID/3月25日 3 月 25 日 所有 语言 维 基 语录 达到 10 000 条目 DBPEDIA_ID/3月24日 3 月 24 日 塞尔维亚 语 维 基 百科 达到 10 000 条目 DBPEDIA_ID/3月22日 3 月 22 日 荷兰语 维 基 百科 达到 60 000 条目 波兰 语 维 基 百科 达到 60 000 条目 DBPEDIA_ID/3月21日 3 月 21 日 挪威 语 维 基 百科 达到 20 000 条目 DBPEDIA_ID/3月17日 3 月 17 日 英文版 维 基 百科 达到 500 000 条目 DBPEDIA_ID/3月12日 3 月 12 日 维 基 资源 对 是否 要 分设 语言 子 域名 准备 重新 开始 投票 DBPEDIA_ID/3月10日 3 月 10 日 維 基 百科 现在 排名 alexa 参考 网站 50 强 的 第 4 名 維 基 百科 中文版 達 到 22000 條 目 第 22000 条目 是 创建 的 DBPEDIA_ID/锆 锆 DBPEDIA_ID/3月9日 3 月 9 日 目前 按照 内部 链 接 数 排列 中文版 进入 前 10 名 位于 葡萄牙语 之前 意大利 语 之后 DBPEDIA_ID/3月5日 3 月 5 日 台湾 维 基 人 在 台北 聚会 DBPEDIA_ID/2月20日 2 月 20 日 維 基 百科 中文版 達 到 21000 條 目 DBPEDIA_ID/2月16日 2 月 16 日 維 基 百科 中文版 條 目 數 超 過 世界 語 版 DBPEDIA_ID/2月6日 2 月 6 日 維 基 百科 中文版 達 到 20000 條 目 DBPEDIA_ID/2月4日 2 月 4 日 达到 10000 个 页面 第 10000 个 页面 是 日 文 的 DBPEDIA_ID/1月26日 1 月 26 日 維 基 百科 中文版 達 到 19000 條 目 DBPEDIA_ID/1月10日 1 月 10 日 维 基 百科 中文版 达到 18000 条目 第 18000 条目 是 创建 的 DBPEDIA_ID/纳米医学 纳米 医学

dav009 · 2016-02-06T00:26:48Z

Training: dimensions:300, min threshold: 10, window: 10

dav009 · 2016-02-06T09:22:29Z

@nick-magnini model is trained, and some basic examples with entities similarities get what seems good results

positive=[u'DBPEDIA_ID/贝拉克·奥巴马', u'DBPEDIA_ID/俄罗斯'], negative=[u'DBPEDIA_ID/美国']
俄罗斯 -- 0.577268242836
DBPEDIA_ID/吉尔吉斯斯坦总统 -- 0.559932947159
DBPEDIA_ID/俄罗斯国家杜马 -- 0.534712553024
DBPEDIA_ID/乌克兰总统 -- 0.523086071014
哈萨克斯坦 -- 0.523066163063
DBPEDIA_ID/2008年俄罗斯总统选举 -- 0.519150972366
DBPEDIA_ID/纳扎尔巴耶夫 -- 0.518714308739
DBPEDIA_ID/哈萨克斯坦 -- 0.513309001923
DBPEDIA_ID/蒙古国总统 -- 0.513016939163
DBPEDIA_ID/普京 -- 0.512183487415
吉尔吉斯斯坦 -- 0.509196817875
DBPEDIA_ID/哈萨克斯坦总统 -- 0.507659435272
DBPEDIA_ID/梅德韦杰夫 -- 0.506721496582
DBPEDIA_ID/库奇马 -- 0.502971172333
DBPEDIA_ID/2012年俄罗斯总统选举 -- 0.502592504025
DBPEDIA_ID/俄罗斯总统 -- 0.501340091228
DBPEDIA_ID/尼古拉·萨科齐 -- 0.501158356667
DBPEDIA_ID/俄罗斯联邦总统 -- 0.501093864441
DBPEDIA_ID/巴基斯坦总理 -- 0.500897169113
DBPEDIA_ID/烏克蘭總統 -- 0.499838590622

Since it looks you are trying to build models with several tools, I will share the corpus + the model.

dav009 · 2016-02-06T12:00:24Z

@nick-magnini

Cleaned Chinese (zh) wiki2vec corpus : https://github.com/idio/wiki2vec/blob/feature/DP-zh-tokenizer-support/torrents/zh_chinese_wiki2vec_cleaned_corpus.torrent
Chinese (zh) Wiki2vec model : https://github.com/idio/wiki2vec/blob/feature/DP-zh-tokenizer-support/torrents/zh_chinese_wiki2vec_model.torrent

tgalery · 2016-02-08T09:51:48Z

Sorry for jumping so late in this discussion, but it might be a good call to implement something more generic, no ? The good thing about using Lucene Analyzers is that you could just use the analyzer for the corresponding locale and the job would be done. This would work for chinese, but also for check and other languages. ICU would be another possibility. Its chinese tokenizer seems to produce results as good as smartcn and again it would be kind of universal. Another possibility would be to specify a tokenizer / analyzer (if we think that simplification, lematization, stemming or morphological analysis would be also desirable operations) interface, so the community can write the respective classes they want.

dav009 · 2016-02-23T15:31:58Z

@nick-magnini any chance you can evaluate the generated model before I jump into a refactor ?

dav009 · 2016-05-18T12:56:51Z

@nick-magnini any news on reviewing the given branch ? otherwise I will close this issue

nick-magnini · 2016-05-18T18:43:01Z

Thanks. Let me discover and explore. Thanks again.

dav009 · 2016-05-26T16:55:24Z

If you are a chinese speaker and you could generate a dataset similar to : https://github.com/arfon/word2vec/blob/master/questions-words.txt it would be great

keynmol added the backlog label Jan 27, 2016

jsgriffin added monster and removed monster labels Apr 8, 2016

Lugrin added icebox and removed backlog labels Apr 10, 2017

mal removed the fandango label Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support request: tokenization #13

Support request: tokenization #13

nick-magnini commented Jan 14, 2016

dav009 commented Jan 15, 2016

nick-magnini commented Jan 15, 2016

dav009 commented Jan 15, 2016

nick-magnini commented Jan 15, 2016

dav009 commented Feb 5, 2016

nick-magnini commented Feb 5, 2016

dav009 commented Feb 5, 2016

dav009 commented Feb 5, 2016

nick-magnini commented Feb 5, 2016

dav009 commented Feb 5, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

tgalery commented Feb 8, 2016

dav009 commented Feb 23, 2016

dav009 commented May 18, 2016

nick-magnini commented May 18, 2016

dav009 commented May 26, 2016

Support request: tokenization #13

Support request: tokenization #13

Comments

nick-magnini commented Jan 14, 2016

dav009 commented Jan 15, 2016

nick-magnini commented Jan 15, 2016

dav009 commented Jan 15, 2016

nick-magnini commented Jan 15, 2016

dav009 commented Feb 5, 2016

nick-magnini commented Feb 5, 2016

dav009 commented Feb 5, 2016

dav009 commented Feb 5, 2016

nick-magnini commented Feb 5, 2016

dav009 commented Feb 5, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

dav009 commented Feb 6, 2016

tgalery commented Feb 8, 2016

dav009 commented Feb 23, 2016

dav009 commented May 18, 2016

nick-magnini commented May 18, 2016

dav009 commented May 26, 2016