/
README.txt
295 lines (268 loc) · 16 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
A Brief Introduction of the ROUGE Summary Evaluation Package
by Chin-Yew LIN
Univeristy of Southern California/Information Sciences Institute
05/26/2005
<<WHAT'S NEW>>
(1) Correct the resampling routine which ignores the last evaluation
item in the evaluation list. Therefore, the average scores reported
by ROUGE is only based on the first N-1 evaluation items.
Thanks Barry Schiffman at Columbia University to report this bug.
This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
the computation of confidence interval (CI) estimation, i.e. CI is only
estimated by the first N-1 evaluation items, but it *does not* affect
average scores.
(2) Correct stemming on multi-token BE heads and modifiers.
Previously, only single token heads and modifiers were assumed.
(3) Change read_text and read_text_LCS functions to read exact words or
bytes required by users. Previous versions carry out whitespace
compression and other string clear up actions before enforce the length
limit.
(4) Add the capability to score summaries in Basic Element (BE)
format by using option "-3", standing for BE triple. There are 6
different modes in BE scoring. We suggest using *"-3 HMR"* on BEs
extracted from Minipar parse trees based on our correlation analysis
of BE-based scoring vs. human judgements on DUC 2002 & 2003 automatic
summaries.
(5) ROUGE now generates three scores (recall, precision and F-measure)
for each evaluation. Previously, only one score is generated
(recall). Precision and F-measure scores are useful when the target
summary length is not enforced. Only recall scores were necessary since
DUC guideline dictated the limit on summary length. For comparison to
previous DUC results, please use the recall scores. The default alpha
weighting for computing F-measure is 0.5. Users can specify a
particular alpha weighting that fits their application scenario using
option "-p alpha-weight". Where *alpha-weight* is a number between 0
and 1 inclusively.
(6) Pre-1.5 version of ROUGE used model average to compute the overall
ROUGE scores when there are multiple references. Starting from v1.5+,
ROUGE provides an option to use the best matching score among the
references as the final score. The model average option is specified
using "-f A" (for Average) and the best model option is specified
using "-f B" (for the Best). The "-f A" option is better when use
ROUGE in summarization evaluations; while "-f B" option is better when
use ROUGE in machine translation (MT) and definition
question-answering (DQA) evaluations since in a typical MT or DQA
evaluation scenario matching a single reference translation or
definition answer is sufficient. However, it is very likely that
multiple different but equally good summaries exist in summarization
evaluation.
(7) ROUGE v1.5+ also provides the option to specify whether model unit
level average will be used (macro-average, i.e. treating every model
unit equally) or token level average will be used (micro-average,
i.e. treating every token equally). In summarization evaluation, we
suggest using model unit level average and this is the default setting
in ROUGE. To specify other average mode, use "-t 0" (default) for
model unit level average, "-t 1" for token level average and "-t 2"
for output raw token counts in models, peers, and matches.
(8) ROUGE now offers the option to use file list as the configuration
file. The input format of the summary files are specified using the
"-z INPUT-FORMAT" option. The INPUT-FORMAT can be SEE, SPL, ISI or
SIMPLE. When "-z" is specified, ROUGE assumed that the ROUGE
evaluation configuration file is a file list with each evaluation
instance per line in the following format:
peer_path1 model_path1 model_path2 ... model_pathN
peer_path2 model_path1 model_path2 ... model_pathN
...
peer_pathM model_path1 model_path2 ... model_pathN
The first file path is the peer summary (system summary) and it
follows with a list of model summaries (reference summaries) separated
by white spaces (spaces or tabs).
(9) When stemming is applied, a new WordNet exception database based
on WordNet 2.0 is used. The new database is included in the data
directory.
<<USAGE>>
(1) Use "-h" option to see a list of options.
Summary:
Usage: ROUGE-1.5.4.pl
[-a (evaluate all systems)]
[-c cf]
[-d (print per evaluation scores)]
[-e ROUGE_EVAL_HOME]
[-h (usage)]
[-b n-bytes|-l n-words]
[-m (use Porter stemmer)]
[-n max-ngram]
[-s (remove stopwords)]
[-r number-of-samples (for resampling)]
[-2 max-gap-length (if < 0 then no gap length limit)]
[-3 <H|HM|HMR|HM1|HMR1|HMR2>]
[-u (include unigram in skip-bigram) default no)]
[-U (same as -u but also compute regular skip-bigram)]
[-w weight (weighting factor for WLCS)]
[-v (verbose)]
[-x (do not calculate ROUGE-L)]
[-f A|B (scoring formula)]
[-p alpha (0 <= alpha <=1)]
[-t 0|1|2 (count by token instead of sentence)]
[-z <SEE|SPL|ISI|SIMPLE>]
<ROUGE-eval-config-file> [<systemID>]
ROUGE-eval-config-file: Specify the evaluation setup. Three files come with the ROUGE
evaluation package, i.e. ROUGE-test.xml, verify.xml, and verify-spl.xml are
good examples.
systemID: Specify which system in the ROUGE-eval-config-file to perform the evaluation.
If '-a' option is used, then all systems are evaluated and users do not need to
provide this argument.
Default:
When running ROUGE without supplying any options (except -a), the following defaults are used:
(1) ROUGE-L is computed;
(2) 95% confidence interval;
(3) No stemming;
(4) Stopwords are inlcuded in the calculations;
(5) ROUGE looks for its data directory first through the ROUGE_EVAL_HOME environment variable. If
it is not set, the current directory is used.
(6) Use model average scoring formula.
(7) Assign equal importance of ROUGE recall and precision in computing ROUGE f-measure, i.e. alpha=0.5.
(8) Compute average ROUGE by averaging sentence (unit) ROUGE scores.
Options:
-2: Compute skip bigram (ROGUE-S) co-occurrence, also specify the maximum gap length between two words (skip-bigram)
-u: Compute skip bigram as -2 but include unigram, i.e. treat unigram as "start-sentence-symbol unigram"; -2 has to be specified.
-3: Compute BE score.
H -> head only scoring (does not applied to Minipar-based BEs).
HM -> head and modifier pair scoring.
HMR -> head, modifier and relation triple scoring.
HM1 -> H and HM scoring (same as HM for Minipar-based BEs).
HMR1 -> HM and HMR scoring (same as HMR for Minipar-based BEs).
HMR2 -> H, HM and HMR scoring (same as HMR for Minipar-based BEs).
-a: Evaluate all systems specified in the ROUGE-eval-config-file.
-c: Specify CF\% (0 <= CF <= 100) confidence interval to compute. The default is 95\% (i.e. CF=95).
-d: Print per evaluation average score for each system.
-e: Specify ROUGE_EVAL_HOME directory where the ROUGE data files can be found.
This will overwrite the ROUGE_EVAL_HOME specified in the environment variable.
-f: Select scoring formula: 'A' => model average; 'B' => best model
-h: Print usage information.
-b: Only use the first n bytes in the system/peer summary for the evaluation.
-l: Only use the first n words in the system/peer summary for the evaluation.
-m: Stem both model and system summaries using Porter stemmer before computing various statistics.
-n: Compute ROUGE-N up to max-ngram length will be computed.
-p: Relative importance of recall and precision ROUGE scores. Alpha -> 1 favors precision, Alpha -> 0 favors recall.
-s: Remove stopwords in model and system summaries before computing various statistics.
-t: Compute average ROUGE by averaging over the whole test corpus instead of sentences (units).
0: use sentence as counting unit, 1: use token as couting unit, 2: same as 1 but output raw counts
instead of precision, recall, and f-measure scores. 2 is useful when computation of the final,
precision, recall, and f-measure scores will be conducted later.
-r: Specify the number of sampling point in bootstrap resampling (default is 1000).
Smaller number will speed up the evaluation but less reliable confidence interval.
-w: Compute ROUGE-W that gives consecutive matches of length L in an LCS a weight of 'L^weight' instead of just 'L' as in LCS.
Typically this is set to 1.2 or other number greater than 1.
-v: Print debugging information for diagnositic purpose.
-x: Do not calculate ROUGE-L.
-z: ROUGE-eval-config-file is a list of peer-model pair per line in the specified format (SEE|SPL|ISI|SIMPLE).
(2) Please read RELEASE-NOTE.txt for information about updates from previous versions.
(3) The following files coming with this package in the "sample-output"
directory are the expected output of the evaluation files in the
"sample-test" directory.
(a) use "data" as ROUGE_EVAL_HOME, compute 95% confidence interval,
compute ROUGE-L (longest common subsequence, default),
compute ROUGE-S* (skip bigram) without gap length limit,
compute also ROUGE-SU* (skip bigram with unigram),
run resampling 1000 times,
compute ROUGE-N (N=1 to 4),
compute ROUGE-W (weight = 1.2), and
compute these ROUGE scores for all systems:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml
(b) Same as (a) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -a ROUGE-test.xml
(c) Same as (b) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -s -a ROUGE-test.xml
(d) Same as (a) but apply a summary length limit of 10 words:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -a ROUGE-test.xml
(e) Same as (d) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -a ROUGE-test.xml
(f) Same as (e) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -s -a ROUGE-test.xml
(g) Same as (a) but apply a summary lenght limit of 75 bytes:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -a ROUGE-test.xml
(h) Same as (g) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -a ROUGE-test.xml
(i) Same as (h) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -s -a ROUGE-test.xml
Sample DUC2002 data (1 system and 1 model only per DUC 2002 topic), their BE and
ROUGE evaluation configuration file in XML and file list format,
and their expected output are also included for your reference.
(a) Use DUC2002-BE-F.in.26.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-F.in.26.lst 26
output: DUC2002-BE-F.in.26.lst.out
(b) Use DUC2002-BE-F.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-F.in.26.simple.xml 26
output: DUC2002-BE-F.in.26.simple.out
(c) Use DUC2002-BE-L.in.26.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-L.in.26.lst 26
output: DUC2002-BE-L.in.26.lst.out
(d) Use DUC2002-BE-L.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-L.in.26.simple.xml 26
output: DUC2002-BE-L.in.26.simple.out
(e) Use DUC2002-ROUGE.in.26.spl.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -n 4 -z SPL DUC2002-ROUGE.in.26.spl.lst 26
output: DUC2002-ROUGE.in.26.spl.lst.out
(f) Use DUC2002-ROUGE.in.26.spl.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -n 4 DUC2002-ROUGE.in.26.spl.xml 26
output: DUC2002-ROUGE.in.26.spl.out
<<INSTALLATION>>
(1) You need to have DB_File installed. If the Perl script complains
about database version incompatibility, you can create a new
WordNet-2.0.exc.db by running the buildExceptionDB.pl script in
the "data/WordNet-2.0-Exceptions" subdirectory.
(2) You also need to install XML::DOM from http://www.cpan.org.
Direct link: http://www.cpan.org/modules/by-module/XML/XML-DOM-1.43.tar.gz.
You might need install extra Perl modules that are required by
XML::DOM.
(3) Setup an environment variable ROUGE_EVAL_HOME that points to the
"data" subdirectory. For example, if your "data" subdirectory
located at "/usr/local/ROUGE-1.5.4/data" then you can setup
the ROUGE_EVAL_HOME as follows:
(a) Using csh or tcsh:
$command_prompt>setenv ROUGE_EVAL_HOME /usr/local/ROUGE-1.5.4/data
(b) Using bash
$command_prompt>ROUGE_EVAL_HOME=/usr/local/ROUGE-1.5.4/data
$command_prompt>export ROUGE_EVAL_HOME
(4) Run ROUGE-1.5.4.pl without supplying any arguments will give
you a description of how to use the ROUGE script.
(5) Please look into the included ROUGE-test.xml, verify.xml. and
verify-spl.xml evaluation configuration files for preparing your
own evaluation setup. More detailed description will be provided
later. ROUGE-test.xml and verify.xml specify the input from
systems and references are in SEE (Summary Evaluation Environment)
format (http://www.isi.edu/~cyl/SEE); while verify-spl.xml specify
inputs are in sentence per line format.
<<DOCUMENTATION>>
(1) Please look into the "docs" directory for more information about
ROUGE.
(2) ROUGE-Note-v1.4.2.pdf explains how ROUGE works. It was published in
Proceedings of the Workshop on Text Summarization Branches Out
(WAS 2004), Bacelona, Spain, 2004.
(3) NAACL2003.pdf presents the initial idea of applying n-gram
co-occurrence statistics in automatic evaluation of
summarization. It was publised in Proceedsings of 2003 Language
Technology Conference (HLT-NAACL 2003), Edmonton, Canada, 2003.
(4) NTCIR2004.pdf discusses the effect of sample size on the
reliability of automatic evaluation results using data in the past
Document Understanding Conference (DUC) as examples. It was
published in Proceedings of the 4th NTCIR Meeting, Tokyo, Japan, 2004.
(5) ACL2004.pdf shows how ROUGE can be applied on automatic evaluation
of machine translation. It was published in Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics
(ACL 2004), Barcelona, Spain, 2004.
(6) COLING2004.pdf proposes a new meta-evaluation framework, ORANGE, for
automatic evaluation of automatic evaluation methods. We showed
that ROUGE-S and ROUGE-L were significantly better than BLEU,
NIST, WER, and PER automatic MT evalaution methods under the
ORANGE framework. It was published in Proceedings of the 20th
International Conference on Computational Linguistics (COLING 2004),
Geneva, Switzerland, 2004.
(7) For information about BE, please go to http://www.isi.edu/~cyl/BE.
<<NOTE>>
Thanks for using the ROUGE evaluation package. If you have any
questions or comments, please send them to cyl@isi.edu. I will do my
best to answer your questions.