The corpus file used to be evaluated #10

valdersoul · 2018-01-28T05:01:24Z

Hi,

When I use the scripts to evaluate the performance, I find there is a corpus file needed by the code. I download the files from the 20news homepage, but the result is not the same as the result file. Could you share the corpus file？

Best

dingran · 2018-02-19T21:28:55Z

Same issue here

I'm using Jey Han Lau's wiki corpus and his script here https://github.com/jhlau/topic_interpretability/blob/master/run-oc.sh

Note that he used 20 words sliding window in his script.

I did a run on the 2nd group of topic here and got results that are substantially different than yours @akashgit

https://github.com/akashgit/autoencoding_vi_for_topic_models/blob/master/coherence_from_paper_script/AVTIM_50

[0.07] ( 0.07; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid
[0.26] ( 0.26; ) nhl hockey wings rangers montreal calgary leafs winnipeg angeles detroit
[0.18] ( 0.18; ) bike amp helmet kit honda turbo gear brake rear engine
[0.14] ( 0.14; ) privacy conduct security enforcement electronic policy encryption states americans agency
[0.19] ( 0.19; ) gun crime criminal assault weapon handgun batf violent abuse firearm
[0.07] ( 0.07; ) myers president stephanopoulos decision meeting congress february package secretary community
[0.17] ( 0.17; ) hitter pitcher baseball defensive braves player bike pitch fan career
[0.23] ( 0.23; ) season nhl hockey league player coach puck pitcher team playoff
[0.14] ( 0.14; ) bike honda turbo helmet gear dealer saturn rear engine amp
[0.10] ( 0.10; ) homicide weapon firearm handgun vancouver knife minority shall gun militia
[0.08] ( 0.08; ) cancer firearm committee volume amendment handgun health states united patient
[0.11] ( 0.11; ) windows gateway swap port window printer modem setup hd mouse
[0.23] ( 0.23; ) muslims islam jews genocide israel jewish muslim islamic jew turks
[0.21] ( 0.21; ) armenian armenia armenians village muslim turks israel lebanon azerbaijan turkish
[0.02] ( 0.02; ) entry output file winner variable oname buf abuse char io
[0.23] ( 0.23; ) god jesus christian revelation moral doctrine resurrection principle faith christianity
[0.15] ( 0.15; ) motherboard meg hd mb mw slot simm ram adapter port
[0.09] ( 0.09; ) azerbaijan armenian armenians apartment neighbor troops father soviet town hide
[0.15] ( 0.15; ) xterm menu client font swap workstation resource server unix directory
[0.06] ( 0.06; ) db mov bh cs dos byte ax connector hd adapter
[0.02] ( 0.02; ) db det mov tor bh que wm mw pit byte
[0.04] ( 0.04; ) entry buf output variable io oname winner char stream printf
[0.11] ( 0.11; ) anonymous probe privacy mission lunar electronic cipher satellite solar launch
[0.12] ( 0.12; ) motherboard mhz shipping mw printer simm slot adapter meg quadra
[0.15] ( 0.15; ) encryption enforcement serial chip encrypt wiretap escrow clipper agency device
[0.17] ( 0.17; ) jesus marriage prophet christ islam god scripture marry verse prophecy
[0.19] ( 0.19; ) bike manual helmet brake honda turbo gear engine rear ford
[0.23] ( 0.23; ) belief god existence faith christ christian doctrine truth teaching christianity
[0.19] ( 0.19; ) vga dos printer isa adapter pc windows ram scsus scsi
[0.22] ( 0.22; ) scsus scsi ide controller bio mb interface floppy ram rom
[0.22] ( 0.22; ) braves playoff season nhl puck hockey leafs wings cup coach
[0.06] ( 0.06; ) administration president stephanopoulos secret island congress russia government libertarian escrow
[0.11] ( 0.11; ) font xt toolkit colormap vendor export server xterm widget directory
[0.17] ( 0.17; ) turks israel israeli armenia turkish village arab armenian genocide greek
[0.10] ( 0.10; ) workstation wiring anonymous null server xterm directory platform ftp toolkit
[0.08] ( 0.08; ) tor det nhl hockey pit que winnipeg league leafs detroit
[0.23] ( 0.23; ) doctrine jesus god christ revelation scripture heaven atheist christian religious
[0.18] ( 0.18; ) season defensive playoff puck flyers coach league braves team penalty
[0.22] ( 0.22; ) pitch hitter player pitcher score baseball season defensive team braves
[0.04] ( 0.04; ) pt rg eus pd pp calgary philadelphia winnipeg detroit bhj
[0.15] ( 0.15; ) turkish armenian turks village genocide jews muslim murder israel greek
[0.13] ( 0.13; ) gun criminal knife crime gang cop batf violent insurance weapon
[0.13] ( 0.13; ) privacy enforcement encryption security americans device secure escrow rsa conversation
[0.22] ( 0.22; ) windows font binary os pc rom vga cache server hardware
[0.16] ( 0.16; ) connector ide scsi scsus quadra motherboard hd mb meg isa
[0.19] ( 0.19; ) motherboard quadra slot mhz hd meg adapter processor simm ide
[0.22] ( 0.22; ) ide controller scsi scsus floppy isa mb meg motherboard connector
[0.20] ( 0.20; ) jesus faith belief verse god scripture passage satan eternal interpretation
[0.22] ( 0.22; ) ram cache swap mb pc mac windows vga os scsi
[0.08] ( 0.08; ) anonymous wiring privacy wire outlet protocol unix nec ripem ftp

==========================================================================
Average Topic Coherence = 0.148
Median Topic Coherence = 0.151

dingran · 2018-02-19T21:29:52Z

BTW, I don't really question the validity of the results from the paper, I really enjoyed reading it. Just want to make sure we are all using the same method to evaluate ;)

AEGISEDGE · 2018-05-30T02:10:02Z

If you run topic model on 20News, the reference corpus to compute topic coherence is 20News accordingly. You'll get confused result if you use different corpus to get topic coherence.

YongfeiYan · 2018-09-08T12:16:03Z

Hi,
I used the 2nd group of topic extracted from coherence_from_paper_script/AVTIM_50, and then calculated the topic coherence score using Jey Han Lau's wiki corpus. What I got likes:

[0.01] ( 0.01; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid
[0.00] ( 0.00; ) nhl hockey wings rangers montreal calgary leafs winnipeg angeles detroit
[0.00] ( 0.00; ) bike amp helmet kit honda turbo gear brake rear engine
[0.02] ( 0.02; ) privacy conduct security enforcement electronic policy encryption states americans agency
[0.01] ( 0.01; ) gun crime criminal assault weapon handgun batf violent abuse firearm
[0.00] ( 0.00; ) myers president stephanopoulos decision meeting congress february package secretary community
[0.02] ( 0.02; ) hitter pitcher baseball defensive braves player bike pitch fan career
[0.06] ( 0.06; ) season nhl hockey league player coach puck pitcher team playoff
[0.00] ( 0.00; ) bike honda turbo helmet gear dealer saturn rear engine amp
[0.00] ( 0.00; ) homicide weapon firearm handgun vancouver knife minority shall gun militia
...
[0.02] ( 0.02; ) windows font binary os pc rom vga cache server hardware
[0.00] ( 0.00; ) connector ide scsi scsus quadra motherboard hd mb meg isa
[0.00] ( 0.00; ) motherboard quadra slot mhz hd meg adapter processor simm ide
[0.00] ( 0.00; ) ide controller scsi scsus floppy isa mb meg motherboard connector
[0.00] ( 0.00; ) jesus faith belief verse god scripture passage satan eternal interpretation
[0.00] ( 0.00; ) ram cache swap mb pc mac windows vga os scsi
[0.00] ( 0.00; ) anonymous wiring privacy wire outlet protocol unix nec ripem ftp

==========================================================================
Average Topic Coherence = 0.008
Median Topic Coherence = 0.000

@dingran How did you run the topic coherence script, and was the same result as reported in the paper obtained when using 20news as the reference corpus?

What I got when using 20news as the reference corpus:

[0.03] ( 0.03; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid
[0.00] ( 0.00; ) nhl hockey wings rangers montreal calgary leafs winnipeg angeles detroit
[0.08] ( 0.08; ) bike amp helmet kit honda turbo gear brake rear engine
[0.08] ( 0.08; ) privacy conduct security enforcement electronic policy encryption states americans agency
[0.11] ( 0.11; ) gun crime criminal assault weapon handgun batf violent abuse firearm
[0.02] ( 0.02; ) myers president stephanopoulos decision meeting congress february package secretary community
...
[0.04] ( 0.04; ) connector ide scsi scsus quadra motherboard hd mb meg isa
[0.05] ( 0.05; ) motherboard quadra slot mhz hd meg adapter processor simm ide
[0.11] ( 0.11; ) ide controller scsi scsus floppy isa mb meg motherboard connector
[0.06] ( 0.06; ) jesus faith belief verse god scripture passage satan eternal interpretation
[0.07] ( 0.07; ) ram cache swap mb pc mac windows vga os scsi
[0.04] ( 0.04; ) anonymous wiring privacy wire outlet protocol unix nec ripem ftp

==========================================================================
Average Topic Coherence = 0.062
Median Topic Coherence = 0.064

akashgit · 2018-09-10T12:54:13Z

the corpus files are here: autoencoding_vi_for_topic_models/data/20news_clean/

akashgit · 2018-09-10T12:57:20Z

UPDATE: we recently found that the TC numbers in the paper are slightly under reported due to the way the TC script works. please make sure that you set the window size to -1 (whole document) if you are using the same script as me.

since there is a bit of confusion regarding the reference corpus: if you dont use the same reference corpus as mentioned in the paper the scores will not be the same. (thanks for mentioning this @AEGISEDGE )
i will update the scores (using -1) here soon

un-lock-me · 2019-09-18T21:38:56Z

If you run topic model on 20News, the reference corpus to compute topic coherence is 20News accordingly. You'll get confused result if you use different corpus to get topic coherence.

I am also curious which corpus refrence they have used.

Based on your comment, you mean for evaluating 20 news group data set they have used 20 news group data set as the external corpus?

I thought its not valid to do like this. Am I missing something here?

un-lock-me · 2019-09-18T21:53:22Z

Hi,
I used the 2nd group of topic extracted from coherence_from_paper_script/AVTIM_50, and then calculated the topic coherence score using Jey Han Lau's wiki corpus. What I got likes:

[0.01] ( 0.01; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid
[0.00] ( 0.00; ) nhl hockey wings rangers montreal calgary leafs winnipeg angeles detroit
[0.00] ( 0.00; ) bike amp helmet kit honda turbo gear brake rear engine
[0.02] ( 0.02; ) privacy conduct security enforcement electronic policy encryption states americans agency
[0.01] ( 0.01; ) gun crime criminal assault weapon handgun batf violent abuse firearm
[0.00] ( 0.00; ) myers president stephanopoulos decision meeting congress february package secretary community
[0.02] ( 0.02; ) hitter pitcher baseball defensive braves player bike pitch fan career
[0.06] ( 0.06; ) season nhl hockey league player coach puck pitcher team playoff
[0.00] ( 0.00; ) bike honda turbo helmet gear dealer saturn rear engine amp
[0.00] ( 0.00; ) homicide weapon firearm handgun vancouver knife minority shall gun militia
...
[0.02] ( 0.02; ) windows font binary os pc rom vga cache server hardware
[0.00] ( 0.00; ) connector ide scsi scsus quadra motherboard hd mb meg isa
[0.00] ( 0.00; ) motherboard quadra slot mhz hd meg adapter processor simm ide
[0.00] ( 0.00; ) ide controller scsi scsus floppy isa mb meg motherboard connector
[0.00] ( 0.00; ) jesus faith belief verse god scripture passage satan eternal interpretation
[0.00] ( 0.00; ) ram cache swap mb pc mac windows vga os scsi
[0.00] ( 0.00; ) anonymous wiring privacy wire outlet protocol unix nec ripem ftp

==========================================================================
Average Topic Coherence = 0.008
Median Topic Coherence = 0.000

@dingran How did you run the topic coherence script, and was the same result as reported in the paper obtained when using 20news as the reference corpus?

What I got when using 20news as the reference corpus:

[0.03] ( 0.03; ) apartment woman neighbor jesus armenians tear daughter soldier hide afraid
[0.00] ( 0.00; ) nhl hockey wings rangers montreal calgary leafs winnipeg angeles detroit
[0.08] ( 0.08; ) bike amp helmet kit honda turbo gear brake rear engine
[0.08] ( 0.08; ) privacy conduct security enforcement electronic policy encryption states americans agency
[0.11] ( 0.11; ) gun crime criminal assault weapon handgun batf violent abuse firearm
[0.02] ( 0.02; ) myers president stephanopoulos decision meeting congress february package secretary community
...
[0.04] ( 0.04; ) connector ide scsi scsus quadra motherboard hd mb meg isa
[0.05] ( 0.05; ) motherboard quadra slot mhz hd meg adapter processor simm ide
[0.11] ( 0.11; ) ide controller scsi scsus floppy isa mb meg motherboard connector
[0.06] ( 0.06; ) jesus faith belief verse god scripture passage satan eternal interpretation
[0.07] ( 0.07; ) ram cache swap mb pc mac windows vga os scsi
[0.04] ( 0.04; ) anonymous wiring privacy wire outlet protocol unix nec ripem ftp

==========================================================================
Average Topic Coherence = 0.062
Median Topic Coherence = 0.064

Hi,

Could you share your 20 newsgroup that you have used as the external source?

Thanks~

un-lock-me · 2019-09-18T22:03:41Z

UPDATE: we recently found that the TC numbers in the paper are slightly under reported due to the way the TC script works. please make sure that you set the window size to -1 (whole document) if you are using the same script as me.

since there is a bit of confusion regarding the reference corpus: if you dont use the same reference corpus as mentioned in the paper the scores will not be the same. (thanks for mentioning this @AEGISEDGE )

i will update the scores (using -1) here soon

Could you share the link that you have used for the coherence?
I have used this link for coherence: https://github.com/jhlau/topic_interpretability/blob/master/run-oc.sh and the result is not the same. Could you share what am I missing here?
What do you mean by setting the document=-1? as the whole corpus in the shared link is in one file.

YongfeiYan · 2019-09-18T23:20:16Z

UPDATE: we recently found that the TC numbers in the paper are slightly under reported due to the way the TC script works. please make sure that you set the window size to -1 (whole document) if you are using the same script as me.

since there is a bit of confusion regarding the reference corpus: if you dont use the same reference corpus as mentioned in the paper the scores will not be the same. (thanks for mentioning this @AEGISEDGE )

i will update the scores (using -1) here soon

Could you share the link that you have used for the coherence?
I have used this link for coherence: https://github.com/jhlau/topic_interpretability/blob/master/run-oc.sh and the result is not the same. Could you share what am I missing here?
What do you mean by setting the document=-1? as the whole corpus in the shared link is in one file.

document = -1 means that, if two words appeared in a document, they have a coappearence, which is used in NPMI counting.
Modify the window_size=0 instead. https://github.com/jhlau/topic_interpretability/blob/b7a7cdc556840cc959252085bc80d1e63031473b/ComputeWordCount.py#L26
I got the tokens of 20NG by the vocabulary and npy files. Since the whole document is used for coappearence, word orders are unnecessary.

un-lock-me · 2019-09-18T23:35:09Z

UPDATE: we recently found that the TC numbers in the paper are slightly under reported due to the way the TC script works. please make sure that you set the window size to -1 (whole document) if you are using the same script as me.

since there is a bit of confusion regarding the reference corpus: if you dont use the same reference corpus as mentioned in the paper the scores will not be the same. (thanks for mentioning this @AEGISEDGE )

i will update the scores (using -1) here soon

Could you share the link that you have used for the coherence?
I have used this link for coherence: https://github.com/jhlau/topic_interpretability/blob/master/run-oc.sh and the result is not the same. Could you share what am I missing here?
What do you mean by setting the document=-1? as the whole corpus in the shared link is in one file.

document = -1 means that, if two words appeared in a document, they have a coappearence, which is used in NPMI counting.
Modify the window_size=0 instead. https://github.com/jhlau/topic_interpretability/blob/b7a7cdc556840cc959252085bc80d1e63031473b/ComputeWordCount.py#L26
I got the tokens of 20NG by the vocabulary and npy files. Since the whole document is used for coappearence, word orders are unnecessary.

Thanks, Yongfeiyan, for the quick reply.

You mean I have to set the window_size = 0?
Also, Im not sure how can I get the tokens with .npy fileand vocab. Do you mind if you share the code with me?

Also, Im using the same code as you. That source code already shared the wiki and news as the corpus refrence. Im not sure if that news is 20 newsgroup or other news.

Thanks~

YongfeiYan · 2019-09-19T00:04:03Z

The .npy file at https://github.com/akashgit/autoencoding_vi_for_topic_models/tree/master/data/20news_clean are BOW format of share D x V, D is the total number of documents, V is the vocab size.
for each row,

from itertools import chain
list(chain(*[[vocab[i]]*v for i, v in enumerate(row)])  # yields the document corresponding to row in npy file.

I uploaded the codes I wrote at https://github.com/YongfeiYan/Neural-Document-Modeling , with 20NG in data dir and modified topic evaluation scripts.
When using window_size=0 with 20NG, the topic coherence should be sound.

un-lock-me · 2019-09-19T00:25:18Z

from itertools import chain
list(chain(*[[vocab[i]]*v for i, v in enumerate(row)])

Sorry Actually I have not worked with pkl files and having kind of difficulty getting it run.

Does it should be like this:


c = np.load('test.txt.npy', encoding='latin1')
print(c)
with open('vocab.pkl', 'rb') as file:
    arr = pickle.load(file)
    print(list(chain(*[[arr[i]] * v for i, v in enumerate(c)])))

Thank you again for taking the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The corpus file used to be evaluated #10

The corpus file used to be evaluated #10

valdersoul commented Jan 28, 2018

dingran commented Feb 19, 2018 •

edited

dingran commented Feb 19, 2018

AEGISEDGE commented May 30, 2018

YongfeiYan commented Sep 8, 2018

akashgit commented Sep 10, 2018

akashgit commented Sep 10, 2018 •

edited

un-lock-me commented Sep 18, 2019

un-lock-me commented Sep 18, 2019

un-lock-me commented Sep 18, 2019

YongfeiYan commented Sep 18, 2019 •

edited

un-lock-me commented Sep 18, 2019

YongfeiYan commented Sep 19, 2019 •

edited

un-lock-me commented Sep 19, 2019

The corpus file used to be evaluated #10

The corpus file used to be evaluated #10

Comments

valdersoul commented Jan 28, 2018

dingran commented Feb 19, 2018 • edited

dingran commented Feb 19, 2018

AEGISEDGE commented May 30, 2018

YongfeiYan commented Sep 8, 2018

akashgit commented Sep 10, 2018

akashgit commented Sep 10, 2018 • edited

un-lock-me commented Sep 18, 2019

un-lock-me commented Sep 18, 2019

un-lock-me commented Sep 18, 2019

YongfeiYan commented Sep 18, 2019 • edited

un-lock-me commented Sep 18, 2019

YongfeiYan commented Sep 19, 2019 • edited

un-lock-me commented Sep 19, 2019

dingran commented Feb 19, 2018 •

edited

akashgit commented Sep 10, 2018 •

edited

YongfeiYan commented Sep 18, 2019 •

edited

YongfeiYan commented Sep 19, 2019 •

edited