Skip to content

Commit

Permalink
Create README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sporting committed Jul 25, 2017
1 parent d54382f commit 956c08b
Showing 1 changed file with 127 additions and 0 deletions.
127 changes: 127 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# gossip_gensim
## 八卦版鄉民斷詞分析

資料來源: ptt 八卦版 2017年4月下旬的所有文章及推文

參考: [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)


## 目錄結構說明:
+ res:
. dict.txt.big: 見參考資料內說明
. gossip.corpus.json: 先前爬過的 ptt 八卦版 2017年4月下旬的所有文章及推文
. gossip.corpus.s2tw.json: 使用 opencc 簡體轉繁體,可見參考資料內說明
. stopwords.txt: 排除部份無意義的單字或符號,可見參考資料內說明

+ models:
. med250.model.bin.*: 經過 gossip_word2vec.py,將已斷詞的文字透過 gensim 轉換

+ output:
. jieba_extract.txt: 經過 gossip_jieba.py,將 ./res/gossip.corpus.s2tw.json 取出斷詞


## pre-install
1. sudo apt-get install opencc
2. python3
3. pip install jieba gensim


## 直接使用 model
python demo.py


## demo
> 提供 3 種測試模式
> 輸入一個詞,則去尋找前一百個該詞的相似詞
> 輸入兩個詞,則去計算兩個詞的餘弦相似度
> 輸入三個詞,進行類比推理
> 鄉民
> 相似詞前 10 排序
> 魔人,0.733771800994873
> 版上,0.7332243919372559
> 酸民,0.7311209440231323
> 粉絲,0.6883684396743774
> 公審,0.6815463304519653
> 小超人,0.6783909797668457
> 腦補,0.6774879097938538
> 跟風,0.6676984429359436
> 腦粉,0.6648016571998596
> 團,0.6647731065750122
> ----------------------------
> 5f
> 相似詞前 10 排序
> 肛爆,0.9444855451583862
> 十樓,0.9420956969261169
> 四叉,0.9415199756622314
> 前列腺,0.9385455846786499
> 菊花,0.934794008731842
> 榨甘蔗,0.930679202079773
> 屁屁,0.9291144609451294
> 彈出來,0.9286264181137085
> hank,0.9283958673477173
> 自肛,0.9281743764877319
> ----------------------------
> 四叉貓
> 相似詞前 10 排序
> rr,0.9452340602874756
> hank,0.9432415962219238
> 甘蔗,0.9406627416610718
> 四叉,0.939492404460907
> 偷看,0.9392256736755371
> 超臭,0.9383402466773987
> 彈出來,0.9372508525848389
> 鼻孔,0.9370623230934143
> 鏡子,0.9354630708694458
> 屁屁,0.9345365762710571
> ----------------------------
> 房思琪
> 相似詞前 10 排序
> 樂園,0.9398347735404968
> 初戀,0.9357335567474365
> 筆下,0.8502717614173889
> 林奕含,0.8244627118110657
> 出書,0.7992949485778809
> 作家,0.7864919304847717
> 思琪,0.7848621010780334
> 遺書,0.7836134433746338
> 改編,0.7790213823318481
> 證實,0.7755028009414673
> ----------------------------
> 右肩
> 相似詞前 10 排序
> 誘姦,0.8586050271987915
> 劈,0.7917353510856628
> 外遇,0.7862498164176941
> 強姦,0.7712001204490662
> 仙人跳,0.7703946232795715
> 已婚,0.7672145962715149
> 吉性,0.7592676877975464
> 腿,0.7587223649024963
> 幼女,0.7491370439529419
> 上牀,0.7470999956130981
> ----------------------------
> 補習
> 相似詞前 10 排序
> 學校,0.9058445692062378
> 課,0.8625339865684509
> 家教,0.8609471321105957
> 唸書,0.8561782240867615
> 上課,0.8540636897087097
> 教學,0.8515065312385559
> 讀,0.8471949100494385
> 醫學系,0.8436121344566345
> 國中,0.8434918522834778
> 教書,0.8280016183853149
> ----------------------------
>
> 相似詞前 10 排序
> 甲甲互,0.8861739635467529
> 屁眼,0.877574622631073
> 六樓,0.8503804802894592
> 肛爆,0.8481118679046631
> 肛門,0.8386288285255432
> 肛死,0.8370318412780762
> 獻出,0.8340945839881897
> 肛到,0.8339065909385681
ps. 2017年4月下旬正好是林亦含事件,所以八卦版很多討論

0 comments on commit 956c08b

Please sign in to comment.