-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
127 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# gossip_gensim | ||
## 八卦版鄉民斷詞分析 | ||
|
||
資料來源: ptt 八卦版 2017年4月下旬的所有文章及推文 | ||
|
||
參考: [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/) | ||
|
||
|
||
## 目錄結構說明: | ||
+ res: | ||
. dict.txt.big: 見參考資料內說明 | ||
. gossip.corpus.json: 先前爬過的 ptt 八卦版 2017年4月下旬的所有文章及推文 | ||
. gossip.corpus.s2tw.json: 使用 opencc 簡體轉繁體,可見參考資料內說明 | ||
. stopwords.txt: 排除部份無意義的單字或符號,可見參考資料內說明 | ||
|
||
+ models: | ||
. med250.model.bin.*: 經過 gossip_word2vec.py,將已斷詞的文字透過 gensim 轉換 | ||
|
||
+ output: | ||
. jieba_extract.txt: 經過 gossip_jieba.py,將 ./res/gossip.corpus.s2tw.json 取出斷詞 | ||
|
||
|
||
## pre-install | ||
1. sudo apt-get install opencc | ||
2. python3 | ||
3. pip install jieba gensim | ||
|
||
|
||
## 直接使用 model | ||
python demo.py | ||
|
||
|
||
## demo | ||
> 提供 3 種測試模式 | ||
> 輸入一個詞,則去尋找前一百個該詞的相似詞 | ||
> 輸入兩個詞,則去計算兩個詞的餘弦相似度 | ||
> 輸入三個詞,進行類比推理 | ||
> 鄉民 | ||
> 相似詞前 10 排序 | ||
> 魔人,0.733771800994873 | ||
> 版上,0.7332243919372559 | ||
> 酸民,0.7311209440231323 | ||
> 粉絲,0.6883684396743774 | ||
> 公審,0.6815463304519653 | ||
> 小超人,0.6783909797668457 | ||
> 腦補,0.6774879097938538 | ||
> 跟風,0.6676984429359436 | ||
> 腦粉,0.6648016571998596 | ||
> 團,0.6647731065750122 | ||
> ---------------------------- | ||
> 5f | ||
> 相似詞前 10 排序 | ||
> 肛爆,0.9444855451583862 | ||
> 十樓,0.9420956969261169 | ||
> 四叉,0.9415199756622314 | ||
> 前列腺,0.9385455846786499 | ||
> 菊花,0.934794008731842 | ||
> 榨甘蔗,0.930679202079773 | ||
> 屁屁,0.9291144609451294 | ||
> 彈出來,0.9286264181137085 | ||
> hank,0.9283958673477173 | ||
> 自肛,0.9281743764877319 | ||
> ---------------------------- | ||
> 四叉貓 | ||
> 相似詞前 10 排序 | ||
> rr,0.9452340602874756 | ||
> hank,0.9432415962219238 | ||
> 甘蔗,0.9406627416610718 | ||
> 四叉,0.939492404460907 | ||
> 偷看,0.9392256736755371 | ||
> 超臭,0.9383402466773987 | ||
> 彈出來,0.9372508525848389 | ||
> 鼻孔,0.9370623230934143 | ||
> 鏡子,0.9354630708694458 | ||
> 屁屁,0.9345365762710571 | ||
> ---------------------------- | ||
> 房思琪 | ||
> 相似詞前 10 排序 | ||
> 樂園,0.9398347735404968 | ||
> 初戀,0.9357335567474365 | ||
> 筆下,0.8502717614173889 | ||
> 林奕含,0.8244627118110657 | ||
> 出書,0.7992949485778809 | ||
> 作家,0.7864919304847717 | ||
> 思琪,0.7848621010780334 | ||
> 遺書,0.7836134433746338 | ||
> 改編,0.7790213823318481 | ||
> 證實,0.7755028009414673 | ||
> ---------------------------- | ||
> 右肩 | ||
> 相似詞前 10 排序 | ||
> 誘姦,0.8586050271987915 | ||
> 劈,0.7917353510856628 | ||
> 外遇,0.7862498164176941 | ||
> 強姦,0.7712001204490662 | ||
> 仙人跳,0.7703946232795715 | ||
> 已婚,0.7672145962715149 | ||
> 吉性,0.7592676877975464 | ||
> 腿,0.7587223649024963 | ||
> 幼女,0.7491370439529419 | ||
> 上牀,0.7470999956130981 | ||
> ---------------------------- | ||
> 補習 | ||
> 相似詞前 10 排序 | ||
> 學校,0.9058445692062378 | ||
> 課,0.8625339865684509 | ||
> 家教,0.8609471321105957 | ||
> 唸書,0.8561782240867615 | ||
> 上課,0.8540636897087097 | ||
> 教學,0.8515065312385559 | ||
> 讀,0.8471949100494385 | ||
> 醫學系,0.8436121344566345 | ||
> 國中,0.8434918522834778 | ||
> 教書,0.8280016183853149 | ||
> ---------------------------- | ||
> 肛 | ||
> 相似詞前 10 排序 | ||
> 甲甲互,0.8861739635467529 | ||
> 屁眼,0.877574622631073 | ||
> 六樓,0.8503804802894592 | ||
> 肛爆,0.8481118679046631 | ||
> 肛門,0.8386288285255432 | ||
> 肛死,0.8370318412780762 | ||
> 獻出,0.8340945839881897 | ||
> 肛到,0.8339065909385681 | ||
ps. 2017年4月下旬正好是林亦含事件,所以八卦版很多討論 |