-
Notifications
You must be signed in to change notification settings - Fork 42
/
Petrarch2.tex
568 lines (479 loc) · 35.4 KB
/
Petrarch2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
\documentclass[11pt]{article}
\usepackage[letterpaper,portrait, margin=1.5in]{geometry}
\usepackage{ifpdf}
\ifpdf
\usepackage[pdftex]{graphicx} % to include graphics
\usepackage{tikz-qtree}
\usepackage{url}
\pdfcompresslevel=9
\usepackage[pdftex, % sets up hyperref to use pdftex driver
plainpages=false, % allows page i and 1 to exist in the same document
breaklinks=true, % link texts can be broken at the end of line
colorlinks=true,
pdftitle=My Document
pdfauthor=My Good Self
]{hyperref}
\usepackage{thumbpdf}
\else
\usepackage{graphicx} % to include graphics
\usepackage{hyperref} % to simplify the use of \href
\fi
\title{PETRARCH 2 : PETRARCHer}
\author{Clayton Norris}
\date{}
\begin{document}
\maketitle
Petrarch 2 is the fourth generation of a series of Event-Data coders stemming from research by
Philip Schrodt and his collaborators.\footnote{Schrodt and Gerner, 1994; \texttt{http://eventdata.parusanalytics.com/KEDS.history.html}} Each iteration has brought new functionality and usability\footnote{Schrodt, 2012; Beieler et al 2016}, and this is no exception.
Petrarch 2 takes much of the power of the original Petrarch's dictionaries and redirects it
into a faster and smarter core logic. Earlier iterations handled sentences
largely as a list of words, incorporating some syntactic information here and
there. Petrarch 2 (henceforth referred to as Petrarch) now views the sentence entirely on the syntactic level. It
receives the syntactic parse of a sentence from the Stanford CoreNLP
software, and stores this data as a tree structure of linked nodes,
where each node is a Phrase object. Prepositional, noun, and verb phrases each
have their own version of this Phrase class, which deals with the logic
particular to those kinds of phrases. Since this is an event coder, the core of
the logic focuses around the verbs: who is acting, who is being acted on, and
what is happening. The theory behind this new structure and its logic is founded
in Generative Grammar, Information Theory, and Lambda-Calculus Semantics.
\section{Tree structure}
In an attempt to take advantage of all the syntactic information provided to us
by the Stanford Parse, Petrarch implements the sentence coding in such a way that
the syntax tree is apparent in the data structure and the logic. It's a simple tree
structure, with every phrase or word being its own node, with pointers to the
parent node and the set of children nodes\footnote{The relationship between nodes
is frequently described using terms of familial relation (most frequently, ``parent'' and ``child''), or direction (``above'' meaning ``closer to the
root,'' and ``below'' meaning ``closer to word-level.'')}.
The syntactic phrases are stored as nested objects of the ``Phrase"
class, which has three subclasses ``NounPhrase," ``VerbPhrase," and ``PrepPhrase."
If a phrase doesn't fall under one of these categories, it is just kept as a
``Phrase," though if eventually enough reason can be shown to add another type (adjective phrases, for example),
it an be done so easily. Each of these phrase types has several methods unique to it
and its own version of a $get\_meaning()$ method, which is what determines the
coding of a node from the meaning of its children. The simplicity of this recursive approach in comparison
with the expensive list-based pattern matching from previous versions of Petrarch yields a
significant speed improvement,
and the theoretically-grounded tree-based semantic searching takes advantage of
the relationships between nouns and verbs encoded in the syntax tree.
\subsection{Syntax trees}
Let's start with a short linguistics lesson. Every sentence in English (and most languages) is made up of several ``constituents''. A constituent can be a single
word or a whole phrase (which is a constituent of constituents), but the defining
characteristic is that each constituent serves a specific syntactic (i.e. grammatical) role.
Constituents of a sentence are associated hierarchically (hence the phrasal constituent-of-constituents),
and so the most convenient way of visualizing or storing syntactic structure
is in a syntax tree. There is an example of a syntax tree and how it is used in
the parse at the end of this document.
The CoreNLP software on which Petrarch relies for syntactic parsing
uses the Penn Treebank II \footnote{\url{https://www.cis.upenn.edu/~treebank/}} syntax notation, which can differ slightly from canonical
generative grammar labeling, but for our purposes they are equally useful.
Constituents have specific type, depending on their ``head'' and distribution. The cases we care most about in this program are Noun, Verb, and
Prepositional phrases. Heads, in this case, are effectively the single word that a phrase
can be reduced to, both semantically and syntactically. They can be predictably located
by navigating the syntax tree, so Petrarch relies on the idea of phrasal headedness
for much of its speed. A head of a phrase can be formalized as the lowest word-level constituent
to which there is an unbroken path of phrase-level similarly-typed constituents from the phrase's root
node. Basically, to find the head of an NP (Noun Phrase), you follow the path of NP's down the
tree until you find a Noun. If there's ever a choice of which path to take, in English you will
take the rightmost\footnote{Many theories of syntax dictate that any node can have at most two children, which would
never yield a situation where you have several choices, but CoreNLP does not follow this binary branching restriction}.
\section{Flow}
The core logic of the semantic parsing is based on the notion that each node in the
tree has a meaning, and the meaning of a node is a combination of the meanings
of its children. That means that in moving up the tree and going from word-level to
sentence-level, words and meanings get combined until you have one noun phrase
meaning and one verb phrase meaning. The meaning of the verb phrase is what captures
most of the meaning of the sentence, and accounts for all the relevant nouns and
verbs below it.
Because of the recursive nature of the meaning determination, one call to
$get\_meaning()$ from the upper most verb phrase will cause a domino effect that
finds the meaning of the rest of the tree. The flow of each specific phrase type
is determined by its $get\_meaning()$ method. While the logical flow can't be
strictly linearized due to the domino effect of recursion, it can be
abstracted to follow these steps:
\begin{enumerate}
\item Read Stanford CoreNLP parse into memory using Phrase classes.
\item Identify coded actors in noun phrases.
\item Identify the usage of the verbs in the verb phrases based on the
dictionary entries.
\item Identify how verbs interact with their constituent verb, prepositional, and noun
phrases.
\item Identify how verbs interact with the noun phrases in their subject
position.
\item Resolve verb+verb interactions.
\item Return the coding of the uppermost VerbPhrase using CAMEO\footnote{\url{http://eventdata.parusanalytics.com/cameo.dir/CAMEO.09b6.pdf}}\footnote{Schrodt, Gerner and Yilmaz } verb and actor codes, if it satisfies the
conditions specified by the user
\end{enumerate}
\section{Classes and class-specific flow}
\subsection{Noun Phrases}
The NounPhrase class only has one unique method, $check\_date()$, which is what
decides which actor code to choose when the code for a specific person or
country changes over time. This is taken almost directly from the older version
of Petrarch.
The $get\_meaning()$ method in the noun phrase both matches the patterns for the
actors and agents of word-level children, and combines the meanings of
constituent PP, NP, and VP children. The priority is given as $Word Patterns > NP > PP
> VP$, and only when actors and agents are not coded will the node finding the
meaning look at a lower-priority phrase. This means that the noun phrase
``American troops in Iraq'' would only code as USAMIL but ``Troops in Iraq'' would code as
IRQMIL.
\subsubsection{Pronouns}
When Petrarch encounters a pronoun, it looks up the tree for an antecedent within the same
sentence. If the pronoun is relfexive (ends with -self or -selves), Petrarch
looks until it finds a noun, or until it finds a verb phrase with a defined
subject, and assigns that as the meaning. However, if the pronoun is not
reflexive, Petrarch moves up until it finds an S-level phrase, then begins its
search. This is based on the binding rules that pronouns follow in Generative Grammar.
Because of the distinction between the two types of pronouns, Petrarch can
correctly identify that ``itself'' in \textit{A said B hurt itself} refers to B,
while ``it'' in \textit{A said B hurt it} refers to A.
Since Petrarch currently has no concept of number or gender, it sometimes makes mistakes
in instances where the pronoun reference depends on the characteristics of the
nouns in the sentence. Such is the case differentiating \textit{Obama told Hillary
that he should run for President again} from \textit{Obama told Hillary that she should run
for President again}. Both of these would be interpreted by Petrarch as \textit{Obama told Hillary that Hillary should run
for President again}
\subsection{Prepositional Phrases}
The $get\_meaning()$ method of PrepPhrase objects returns the meaning of their
non-preposition constituent. This makes it easier for the actor searching to
pass through prepositional phrases. The preposition is stored as an attribute of
the object and is used in some cases to determine whether or not a certain PrepPhrase should
be considered.
\subsection{Verb Phrases}
Verb phrases drive most of the complex logic of the program. They play the
largest role in all three parts of finding ``who did what to whom'', assigning verb
codes and finding the appropriate noun phrases to fit. The $get\_meaning()$
method of verb phrases relies on three other verb-specific methods:
\subsubsection{$get\_upper()$}
This method is fairly simple. If the VP has an NP specifier \footnote{In Syntax, two phrase-level
siblings are called specifiers. These occur most frequently between VP's,NP's,and PP's. The NP specifier
of a VP is the phrase that contains the grammatical subject of the verb.} with a coded
actor, it returns this. Otherwise, this returns nothing.
\subsubsection{$get\_lower()$}
This is slightly more complicated. In most cases, the verb $get\_lower()$ method
behaves very similarly to the NounPhrase $get\_meaning()$ method. It looks for
some coded actor in noun or prepositional phrases, and returns this.
However, if
a VP has a VP as a child, it returns the meaning of only that phrase, as well as looking
for some sort of negation word. The only
VP's with VP children are modals (could, might, will, etc.) or helping verbs (has, is,
do, etc.)\footnote{If it intuitively seems like a verb would have another verb phrase as a child, but it does not fall into one of these categories,
it most likely takes a sentence as a child, rather than a verb phrase. }. These won't have other NP, or PP children that are relevant to
this verb, but can have ``not'' as a child, so this is where negation is
flagged.
\subsubsection{$get\_code()$}
This is where the program looks to see if the verb follows a pattern
specified in the dictionary. The patterns consist of four parts: \begin{enumerate}
\item Pre-verb noun phrases \item Pre-verb prepositional phrases \footnote{I can't think of a scenario where
this would actually be necessary, but the option is there for consistency's sake.} \item
Post-verb noun phrases \item Post-verb prepositional phrases \end{enumerate}
The process from this point differs for active and passive verbs, but only in
where each search takes place. Active verbs look for (1) at the closest S-level (Sentence node) above the verb, i.e. the
nearest point where there will be an NP specifier. It first finds this level via
the $get\_S()$ method, and looks to see if the head of the NP specifier is part
of a pattern. If a head is found and there is more to the pattern's noun phrase,
then the program begins to look for the
rest of the pattern phrase in the noun phrase from which the head came.
The verbs find (2) in the same place, but in PP's instead of NP's. Since we
almost never see patterns with this format in English, this hasn't been fine
tuned. Then the search begins for (3). This involves checking if any of the
heads of child NP's are part of a pattern. Then Petrarch follows the same process of looking for
longer noun patterns within the phrases of the respective heads.
Part (4) looks at child PP's for matches, then matches nouns within the phrases
if necessary by the same methods it matched child NP's.
For passive verbs, the process for prepositions is exactly the same. However, it
looks for (1) inside the NP's of child PP's with the preposition ``by,'' ``from,'' or
``in,''. If no such phrase is found, the verb is left without a subject. This is simply a specific case to deal with how English deals with the
party that is performing the action in a passive sentence. (3) is found in the
same place that (1) is found in the active sentences.
As an illustration, consider the active and passive forms of a simple sentence that would match the pattern
$$ protesters * monument \hspace{4mm} [145]\hspace{4mm} \# DESTROY $$
\begin{enumerate} \item The protesters destroyed the
monument.
\vspace{10mm}
\begin{tikzpicture}[scale= .75]
\Tree [.\textcolor{red}{S} [.\textcolor{red}{NP} [.\textcolor{red}{NN} {\bf \textcolor{orange}{PROTESTERS}} ] ]
[.VP [.VBD {\bf \emph{ DESTROYED}} ] [.\textcolor{blue}{NP} [.DT {\bf THE} ] [.\textcolor{blue}{NN} {\bf \textcolor{green}{MONUMENT}} ] ]
]
[.. {\bf .} ] ] \end{tikzpicture}
\vspace{10mm}
\item The monument was destroyed by protesters.
%\vspace{10mm}
\begin{tikzpicture}[scale= .75]
\Tree [.\textcolor{blue}{S} [.\textcolor{blue}{NP} [.DT {\bf THE} ] [.\textcolor{blue}{NN} {\bf \textcolor{green}{MONUMENT}} ] ]
[.VP [.VBD {\bf WAS} ] [.VP [.VBN {\bf \emph{DESTROYED} } ] [.\textcolor{red}{PP} [.IN {\bf \textcolor{red}{BY}} ]
[.\textcolor{red}{NP} [.\textcolor{red}{NN} {\bf \textcolor{orange}{PROTESTERS}} ] ]
] ]
]
[.. {\bf .} ] ] \end{tikzpicture}
\vspace{10mm}
\end{enumerate}
Key: \textcolor{red}{(1) Location} \textcolor{orange}{(1) Match}
\textcolor{blue}{(3) Location} \textcolor{green}{(3) Match}
Note that this is only for matching patterns entered in the dictionary, not Source and Target matching.
That happens within the $get\_meaning()$ method, based on the outcomes of $get\_upper()$
and $get\_lower()$.
\subsubsection{$get\_meaning()$}
The $get\_meaning()$ method of the Verb class first combines the values of the
previous three methods in one of a number of ways, depending on what those
methods find. In most cases, this method returns a list of events that are described by the
subtree of which the verb phrase is the root. Sometimes, however, if there isn't
enough verb information available, it will simply return the list of actors described by
the subtree. In deciding what to do, the verb has several things to consider:
\begin{itemize}
\item Do I have a source actor? (from $get\_upper()$)
\item Do I have a code? (from $get\_code()$)
\item Do I have a S-level or VP child? (from $get\_lower()$)
\item If so, does that child code an event?
\item If so, how does the event that I code relate the event that it codes?
\end{itemize}
\subsubsection{$match\_transform()$}
This method accounts for the fact that ontologies don't always
line up exactly with how words work. For example, there are times when you get
a sentence like ``A says it will attack B,'' but what you're looking to code is ``A threatens B.''
$Match\_transform()$ reads transformations from the Verb dictionary and checks
to see if any of the events match the transformation format. If that's the
case, then the event is converted into a simple (S,T,V) format. The entry in
the dictionary for that example would be $$ a \hspace{3mm}(a \hspace{3mm} b \hspace{3mm} WILL\_ATTACK)\hspace{3mm} SAY = a\hspace{3mm} b\hspace{3mm} 138 $$
which is basically post-fix notation. This is described in more detail in the dictionary
specifications.
\subsubsection{$is\_valid()$}
This method is used to catch a consistent mistake that happens in CoreNLP when a
past participle is used as an adjective in front of a noun, but is instead coded
as a verb.
\subsection{Event extraction}
One call to the $get\_meaning()$ method of the uppermost VP will cause the rest
of the tree to be parsed, and return the event coding of that VP, which is the
event coding of the whole tree. Since not all events of the sentence at this
point might not be complete, the Sentence object which contains the Phrase tree
will call $get\_meaning()$ in its $get\_events()$ method, and check to see if
the event is satisfactory. If the event that is
returned by $get\_meaning()$ is a complete coding (has all three parts), it is assigned to the
sentence and the process is complete.
\section{Dictionaries}
Petrarch uses the same Actor, Agent, Discard, and Issue dictionaries as it
always has, but the newest version has brought changes to the format and
structure of the Verb Dictionary. The sets of synonymous nouns (synsets) remain
the same, as well as how the base verbs are organized and stored. The two
biggest differences are the transformations, which $match\_transform()$ looks at,
and the patterns for matching phrases.
\subsection{Patterns}
The patterns in the dictionaries should now follow a few simple rules:
\begin{enumerate}
\item The intended pattern should contain exactly one verb: the verb being
matched
\item The pattern entries should be minimal, i.e. the smallest
amount of information necessary to capture the intended phrases. This is just
to keep the dictionary small but effective.
\item The pattern has up to four parts: Pre-verb nouns, Pre-verb Prepositions,
Post-verb nouns, Post-verb prepositions.
\end{enumerate}
The patterns themselves also contain additional annotative symbols to provide
the parser with more syntactic information:
\begin{itemize}
\item Unmarked words are nouns or particles. These nouns are phrase heads.
\item \{Bracketed phrases\} are for specifying things that can't be covered by
a single noun, e.g. (necessary) adjectives, complex nouns, etc.. The last word in the brackets should be the head.
\item Prepositional phrases are (in parentheses). The first word is the
preposition, the rest is considered as nouns are.
\item Note that these prepositional markers can be combined (with \{Noun Phrases\})
\end{itemize}
\subsection{Verb+Verb interaction}
\subsubsection{Combinations}
Verbs can interact with each other in one of two ways. The first is what we
call a combination. This is what happens when the meaning of the two verbs
together is literally the meanings of the two verbs individually. These occur
mostly frequently to specify the subcode of somewhat vague or high-level
CAMEO categories, like \textit{appeal, intend, refuse} or \textit{demand}.
This is handled using an internal translation of CAMEO codes into a system that
expands the hierarchy of CAMEO beyond the basic top-level/subcode classification
system. This allows for more controllable processing of verb combinations that
are inherent in CAMEO. So rather than a system where``Intend [030] + Help [070] =Intend to help[033],''
we get ``Intend [3000] + Help [0040] =Intend to help[3040].'' The full
conversion schema can be found in the \textit{utilities.py} file under
$convert\_code()$.
Codes are converted and stored as four-digit hex (base 16) codes.
The general principle behind it is in the table below. The
first three columns encompass the top-level codes, the fourth position is a specifier. For the
most part they follow the descriptions here, but some top-level codes have unique subclasses,
which don't follow these specifically. Notice that not all
combinations refer to CAMEO codes. This is intentional, and means that if we
wanted to code things beyond CAMEO we could. The strength of this is
predictability and the possibility of semantic addition. When returning the
event code, Petrarch converts back to CAMEO for the sake of
reverse compatibility.
\vspace{12mm}
\begin{tabular}{llll}
0 & 0 & 0 & 0 \\
1 Say & 1 Reduce & 1 Meet & 1 Leadership \\
2 Appeal & 2 Yield & 2 Settle & 2 Policy\\
3 Intend &~ & 3 Mediate & 3 Rights\\
4 Demand &~ & 4 Aid & 4 Regime\\
5 Protest &~ & 5 Expel & 5 Econ\\
6 Threaten &~ & 6 Pol. Change & 6 Military\\
7 Disapprove &~ & 7 Mat. Coop & 7 Humanitarian\\
8 Posture &~ & 8 Dip. Coop & 8 Judicial\\
9 Coerce &~ & 9 Assault & 9 Peacekeeping\\
A Investigate &~ & A Fight & A Intelligence\\
B Consult &~ & B Mass violence & B Admin. Sanctions\\
~ &~ & ~ & C Dissent\\
~ &~ & ~ & D Release\\
~ &~& ~ & E Int'l Involvement\\
~ &~& ~ & F De-escalation\\
\end{tabular}
\vspace{12mm}
The one class not present here is 120, which classifies rejections and refusal
to cooperate. Because the action of ``refusing to do X'' is so often the same as
``not doing X,'' these are simply categorized as the value of their
cooperative version minus 0xFFFF. So, since ``provide aid'' is 0040, ``refuse to provide
aid'' is 0040-FFFF = -FFBF. This is functionally equivalent to the negative,
since there is no positive FFFF code, the subtraction always yields a negative value. This allows us
to convert negations such as ``WILL NOT HELP'' = $0 - FFFF + 0040 = -FFBF $ = ``REFUSE TO HELP.''
\subsubsection{Transformations}
Sometimes this is insufficient, like when the meaning of the verb interaction
depends also on the relationships between the nouns that are acting and being
acted upon. The difference between ``A says B attacked C'' and ``A says A
attacked B'' is such a case. The first is equivalent to ``A blames B for an
attack,'' and the second ``A takes credit for an attack on B.'' Since this
depends on the nouns involved, we must consider them in the transformation
category and not the combination category. The specification on how these
are formatted is in the documentation.
\section{Example}
Consider the sentence
\begin{itemize} \item``Israel said a mortar bomb was launched at it from the
Gaza strip on Tuesday''\end{itemize}
Petrarch would code this sentence as \textit{ISR PSEGZA 112} with the following tree:
\footnote{For those unfamiliar with CAMEO verb codes, 190 is an organized attack, while 112 is an accusation of aggression}\\
\begin{tikzpicture}[scale= .5]
\Tree [.S [.NP [.\color{red}{+ISR} [.NNP {\bf ISRAEL} ] ] ]
[.VP [.{\it \color{red}{ISR} \color{blue}{PSEGZA} \color{black}{ 112}} [.VBD {\bf SAID} ] [.SBAR [.S [.NP [.{} [.DT {\bf A} ] [.JJ {\bf MORTAR} ] [.NN {\bf BOMB} ] ] ]
[.VP [.{\it \color{blue}{PSEGZA} \color{orange}{ISR} \color{black}{190}} [.VBD {\bf WAS} ] [.VP [.{\it \color{blue}{PSEGZA} \color{orange}{ISR} \color{black}{190}} [.VBN {\bf LAUNCHED} ] [.PP [.IN {\bf AT} ] [.NP [.\color{orange}{+ISR} [.PRP {\bf IT} ] ] ]
] [.PP [.IN {\bf FROM} ] [.NP [.\color{blue}{+PSEGZA} [.NP [.{\color{blue}{+PSEGZA}} [.DT {\bf THE} ] [.NNP {\bf GAZA} ] [.NNP {\bf STRIP} ] ] ]
[.PP [.IN {\bf ON} ] [.NP [.NNP {\bf TUESDAY} ] ] ]
] ]
] ] ]
] ] ]
] ]
]
[.. {\bf .} ] ] \end{tikzpicture}
The color coding shows where the actor codes come from. The significant steps taken in this parse involve the verbs ``said'' and
``flaunched,'' and the pronoun ``it.'' The pronoun coreference follows the
non-reflexive matching process described above. When Petrarch is analyzing
``launched,'' it \begin{enumerate}
\item Identifies the verb as passive
\item Finds the patterns for this verb
\item Finds the target under the prepositional phrase with ``it''
\item Identifies the antecedent of ``it'' to be ``ISR''
\item Finds the source under the prepositional phrase with ``from''
\end{enumerate}
Then, the analysis of ``said'' follows the process:
\begin{enumerate}
\item Finds the lower event (PSEGZA ISR 190)
\item Identifies the subject of ``said'' as ISR
\item Matches this with the dictionary-specified verb transformation\\ \textit{a (b . ATTACK) SAY}
= a b 112
\item Transforms this into \textit{ISR PSEGZA 112}.
\end{enumerate}
\section{New options for Identifying actors and verb phrases\\{\normalsize Philip Schrodt, Parus Analytics LLC (\texttt{schrodt735@gmail.com}) \\Last update: 28 June 2016}}
Since Clayton's work in Summer-2015, I've added some additional new options in conjunction with various projects. As PETRARCH still doesn't really have a manual, I'll use this document to, well, document those.
\subsection{\texttt{--nullverbs} and \texttt{--nullactors}: Identifying verb and noun phrases which are not in the dictionaries}
These are complementary command line options intended for use in dictionary development:
\begin{itemize}
\item \texttt{--nullverbs} (or \texttt{-nv}) produces a file of verb phrases which have a valid source and target but which are not in the dictionary. This can also be invoked using the option \texttt{null\_verbs = True} in the \textit{config.ini} file.
\item \texttt{--nullactors} (or \texttt{-na}) produces a file of noun phrases which are associated with a codable verb phrase but which are not in the dictionary. This can also be invoked using the option \texttt{null\_actors = True} and setting \texttt{new\_actor\_length} \textgreater 0 in the \textit{config.ini} file.
\end{itemize}
In other words, \texttt{--nullverbs} shows unrecorded actions that are occurring between known actors, and \texttt{--nullactors} shows unrecorded actors that are engaging in known behaviors. With both options, no events are generated and instead the output file is a series of JSON records which can be read as Python dictionaries: examples are given below.
These options come \textit{before} the \texttt{batch} or text{parse} command. Only one of the two options can be used in a given run.\footnote{otherwise you are simply producing a list of all sentences that don't generate a complete event, which can be done just by removing the sentences in the event list from the complete corpus.}
\subsubsection{\texttt{--nullverbs}}
\noindent \textbf{Command example: }
\texttt{python petrarch2.py -nv batch -i nullverb.test.txt -o test\_output.txt}
\bigskip
\noindent \textbf{Output example:}
\begin{verbatim}
{
"id": "11077096-26e7-4f4b-8636-eca78311f2d0",
"sentence": "TOKYO HAS ASKED THE CHINESE AUTHORITIES TO CLARIFY THOSE ISSUES
BUT BEIJING HAS NOT YET STRAIGHTENED THEM OUT , HE SAID , ADDING
THAT JAPAN WOULD CONTINUE TO TALK TO CHINA ABOUT THIS .",
"phrase": "HAS NOT YET STRAIGHTENED THEM OUT ",
"parse": "(VP (VBZ HAS) (RB NOT) (ADVP (RB YET)) (VP (VBD STRAIGHTENED)
(NP (PRP THEM)) (PRT (RP OUT)))",
"source": "CHN", "target": "JPN"
}
\end{verbatim}
\noindent \textbf{Implementation notes:}
\begin{itemize}
\setlength{\itemsep}{4pt}
\item In a typical news story, verb phrases are deeply nested, so it is not uncommon for the phrase to extend to the end of the story. In order to identify what a human coder would consider the relevant verb and information immediately following it, the phrase is truncated at the first \texttt{`(S'} following the \texttt{`(VP'} which marks the beginning of the phrase. This rule can be modified in the routine \texttt{PETRWriter.write\_nullverbs()}: the entire parse string is available to this routine and truncation is only done during the output procedure.
\item \texttt{"phrase"} in the output gives this phrase without any markup; \texttt{"parse"} shows the phrase with the original Treebank markup
\item \texttt{"source"} and \texttt{"target"} show the codes for the source and target associated with this phrase; these can be agent codes.
\end{itemize}
\subsubsection{\texttt{--nullactors}}
Command example: \\ \texttt{python petrarch2.py -na 8 batch -i nullactor.test.txt -o test\_output.txt}
\bigskip
\noindent Output example:
\begin{verbatim}
{
"id": "11077096-26e7-4f4b-8636-eca78311f2d0",
"sentence": " Tokyo has asked the Chinese authorities to clarify those
issues but Beijing has not yet straightened them out, he said,
adding that Japan would continue to talk to China about this . ",
"source": "Tokyo [JPN]",
"target": "this",
"evtcode": "010",
"evttext": "would ... talk",
}
{
"id": "NULL-11077096-26e7-4f4b-8636-eca78311f2d0",
"sentence": "Winterfell has asked the Lannister families to clarify those
issues but Beijing has not yet straightened them out, he said,
adding that Japan would continue to talk to China about this . ",
"source": "Winterfell",
"target": "the Lannister families",
"evtcode": "020",
"evttext": "has asked",
}
\end{verbatim}
\noindent \textbf{Implementation notes:}
\begin{itemize}
\setlength{\itemsep}{4pt}
\item If set in the command line, the required integer following the command sets the \texttt{new\_actor\_length} parameter---the maximum number of words that a new actor can have---and overrides any settings in the \texttt{config.ini} file (if set in the \texttt{config.ini} file, that value will be used). Large values of this will give add a lot of extended noun phrases that probably aren't named-entities into the output; small values run the risk that a named entity preceded by several adjectives---e.g. \textit{``The beleaguered Japanese Prime Minister Shinzo Abe''}---will be missed. So, adjust this value based on your source texts, experience, and post-filtering programs: values of 4 to 8 will usually work for typical English-language names.
\item If a source or target is in the dictionary, the phrase will be followed by the code in brackets \texttt{[\ldots]}
\item \texttt{"evtcode"} is the CAMEO code for the verb phrase; \texttt{"evttext"} is the text which generated this code.
\item The actors not in the dictionary are assigned temporary codes of the form \texttt{"*i*"} where $i$ is an integer. These will show up in agent code such as \texttt{"*4*GOV"} (instead of \texttt{"- - -GOV"}) and in the source and target markers in the verb text, e.g. \texttt{"<*2*> ... contradicted by <*3*>"}. See Section \ref{sssec:writeeventtext} on the construction of event texts.
\end{itemize}
\subsection{Extracting text used in coding: \texttt{write\_actor\_text}, \texttt{write\_actor\_text} and \texttt{write\_event\_text} configuration options}
These three options in the configure file are used to try to extract the text which generated an event code: they are not flawless\footnote{Yeah, right, unlike the rest of the system. But in particular, compound nouns will often generate problems: this is on the GitHub Issues list.} but work most of the time.
In the output, anything in angle-brackets \texttt{<\ldots>} corresponds to something in the matched text that couldn't be found in the original text. This is usually due to one or more of the following:
\begin{itemize}
\setlength{\itemsep}{4pt}
\item A verb pattern which contained a synonym set, in which case the synonym set name is used, e.g. \texttt{mobilized <\&MILITARY>}
\item A verb pattern which contained a source (\$) or target (+) token, in which case the code for the actor is used , e.g. \texttt{<---REB> ... overwhelmed <---COP>}
\item The system couldn't find text that was, for some reason, duplicated in the system---there are still a few bugs here---in which case it puts what it is looking for in the brackets
\item Situations where Stanford Core NLP has taken a word apart, e.g. ``\texttt{didn't}'' converts to \texttt{DID N'T}
\end{itemize}
Ellipses \texttt{\ldots} indicate words were skipped over in a multi-word phrase.
\bigskip
The texts are added to the end of the event record as tab-delimited fields in the order \texttt{[actor\_text, event\_text, actor\_root]} : the code for this can be changed in \texttt{PETRWriter.write\_events()}.
\subsubsection{\texttt{write\_actor\_root:}}
If True, the event record will include the text of the actor root: The root is the text at the head of the actor synonym set in the dictionary. Default is False
\subsubsection{\texttt{write\_actor\_text:}}
If True, the event record will include include the complete text of the noun phrase that was used to identify the actor. Default is False
\subsubsection{\texttt{write\_event\_text:}} \label{sssec:writeeventtext}
If True, the event record will include include the complete text of the verb phrase that was used to identify the event: this is supposed to be the text that was used to generate the event and that worked most of the time. Items in \texttt{<\ldots>} indicate actors or synonyms sets in the verb patterns. Default is False
\section{Additional updates on the project\\{\normalsize Philip Schrodt, Parus Analytics LLC (\texttt{schrodt735@gmail.com}) \\Last update: 28 June 2016}}
PETRARCH and various ancillary programs are now supported in part by the U.S. National Science Foundation Resource Implementations for Data Intensive Research in the Social Behavioral and Economic Sciences (RIDIR) Program through a grant titled ``
Modernizing Political Event Data for Big Data Social Science Research.'' The lead institution is the University of Texas at Dallas (PI: Patrick Brandt) with additional participation by investigators at the University of Oklahoma, University of Minnesota, University of Delaware, and John Jay College, with roughly equal participation by political science and computer science departments.
The kickoff for the project was early December 2015 and the grant will provide three to four years of funding. The project still doesn't have a name, logo or t-shirts, or even a web site, but we will provide this information as it becomes available. At present, the GitHub site for the project remains \texttt{https://github.com/openeventdata}
\section{Works Cited}
\begin{thebibliography}{9}
\bibitem{lamport94}
Beieler, John , Patrick Brandt, Andrew Halterman, Philip Schrodt and Erin Simpson. 2016. Generating Political Event Data in Near Real Time: Opportunities and Challenges.'' In Michael Alvarez, ed. \textit{Computational Social Science: Discovery and Prediction.} Cambridge: Cambridge University Press.
\bibitem{lamport94}
Schrodt, Philip A. and Deborah J. Gerner. 1994. Validity Assessment of a Machine-Coded Event Data Set for the Middle East, 1982-1992. \textit{American Journal of Political Science} 38:825-854.
\bibitem{lamport94}
Schrodt, Philip A. 2012. "Precedents, Progress and Prospects in Political Event Data." \textit{International Interactions} 38(5):546-569.
\bibitem{lamport94}
Schrodt, Philip A., Deborah J. Gerner and and \"{O}m\"{u}r Yilmaz. 2008. Conflict and Mediation Event Observations (CAMEO): An Event Data Framework for a Post Cold War World. In Jacob Bercovitch and Scott Gartner. \textit{International Conflict Mediation: New Approaches and Findings}. New York: Routledge.
\end{thebibliography}
\end{document}