Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect PENMAN encoding oof triple list with duplicate :instance triple #113

Open
jheinecke opened this issue Mar 29, 2023 · 6 comments
Open
Labels
enhancement New feature or request

Comments

@jheinecke
Copy link

I found an (possible) issue with doubled instances (e.g. twice n / name instead a simple n in the second case) in a penman notation. Parsing the following graph
with

g = penman.decode(amr)
for x in g.triples:
    print(x)
(a2 / and
         :op1 (c2 / country
                 :name (n / name))
         :op2 (c3 / country
                 :name (n / name))
)

returns

('a2', ':instance', 'and')
('a2', ':op1', 'c2')
('c2', ':instance', 'country')
('c2', ':name', 'n')
('n', ':instance', 'name')
('a2', ':op2', 'c3')
('c3', ':instance', 'country')
('c3', ':name', 'n')
('n', ':instance', 'name')

with ('n', ':instance', 'name') twice
If I want to re-encode this triples into penman with pm = penman.encode(penman.Graph(tr)); print(pm) I get the incorrect penman

(a2 / and
    :op1 (c2 / country
             :name (n / name
                      / name))

I wonder where the bug is. I agree that the initial penman is at least suboptimal since n / name occurs twice ant should have only n in the second place. since n's concept is name in both cases that sould not be an fatal error.
Or is it the fact that ('n', ':instance', 'name') occurs twice which creates an incorrect penman.
So if the initial penman graph is acceptable, than there is clearly a problem in the decode/encode methods ?

@goodmami
Copy link
Owner

Thanks for the report! I agree that the output is bad. However, the input is not a valid PENMAN string, either. I'm not certain if the reuse of the n variable is an accident for two named nodes (e.g., where the sentence it represents is something like "Argentina and Chile"), in which case the second one should be n2, or if it is meant to be a distributed node (a feature which isn't really supported in AMR), in which case the second / name is at least redundant but probably just incorrect.

So I'm not fully convinced that there is a bug here that should be fixed, because the input is already bad. Do you have more context about where this AMR came from?

@jheinecke
Copy link
Author

jheinecke commented Mar 30, 2023

The incorrect initial Penman is an annotation error. It should have been n2 / name. Usually I use your lib to check whether the Penman is OK, so I thought it would catch it :-) . In my annotation tool (public very soon) I transform penman into triples to display and back to penman to store them. I didn't check for double triples (done now) so I came across this error. But some other tools may create duplicate triples and in this case penman.encode() produces an invalid penman. So maybe we can correct this?
BTW, I wonder whether in AMR (or indeed Penman) reused variables can or cannot have a second (identical) instantiation.
So for
the boy wants to believe the normal Penman AMR would be

(w / want-01
   :ARG0 (b / boy)
   :ARG1 (b2 / believe-01
             :ARG0 b))

but is the following really invalid? I couldn't find any documentation on this)

(w / want-01
   :ARG0 (b / boy)
   :ARG1 (b2 / believe-01
             :ARG0 (b / boy) ))

@goodmami
Copy link
Owner

Ok, thanks for explaining. Yes, I now agree that it would be useful for Penman to be able to detect such bad graphs, and the current methods are insufficient (a warning is currently issued in the above case, and you could cause that warning to become an error, but that is not a good solution). Do you have a proposal of how this check might be implemented?

BTW regarding your question of duplicate instantiations, my opinion is that it is incorrect. Information should not be repeated, even if the information is consistent. There is the idea of distributed nodes where new information can be added at a later point, so, e.g., if the sentence were instead the young boy wants to believe, it could be presented as:

(w / want-01
   :ARG0 (b / boy)
   :ARG1 (b2 / believe-01
             :ARG0 (b :mod (y / young))))

However I don't think I've seen any AMR example demonstrating this, so I assume it is invalid (pre-AMR PENMAN may have allowed it, but I can't turn up the reference just now).

In any case, for our current purposes we need an idea of how the check is implemented and where it is exposed in the API.

@goodmami goodmami added the enhancement New feature or request label Mar 31, 2023
@jheinecke
Copy link
Author

Hi, I agree that we should consider a second instantiation of the same variable (even with the same concept) is an error.

for the duplicate triple, I do not see any warning (with version 1.2.2) , so

tr = [ ('n', ':instance', 'name'),
          ('n', ':instance', 'name'),]
pm = penman.encode(penman.Graph(tr))
print(pm)

outputs the incorrect

(n / name
   / name)

However I see a warning when decoding (the incorrect)

 (a2 / and
               :op1 (c2 / country
                        :name (n / name))
               :op2 (c3 / country
                        :name (n / name))
g = penman.decode(amr)

which gives ignoring epigraph data for duplicate triple: ('n', ':instance', 'name'). However it does not ignore the triple in the triple (it is included in g.triples and in g.instances()`). Maybe I miss something here and should use something different?

So if penman.encode() gave the same warning (and ignoring the duplicate would be fine

@goodmami
Copy link
Owner

goodmami commented Apr 1, 2023

@jheinecke yes, I meant the warning appears when decoding the erroneous AMR, not when encoding it. The epigraph is metadata about the graph to help it be configured into the same tree-like structure that was read in. The metadata is stored as a mapping of triples to epigraphical markers, so duplicate triples would not work with such a mapping. The triples themselves, however, are stored as a list, so it is possible to have duplicate triples. Upon configuring the graph to a tree (or encoding it to a string), the duplicates will be inserted at the same place as it assumes singular instantiation of nodes, leading to the behavior you see. It is somewhat convenient to allow the graph to store the duplicate triples, however, because then we can use that data in scripts that, e.g., detect and/or clean up such errors.

There is actually no error or warning when parsing the string into the tree, it's only when interpreting the tree into a graph that we get a warning:

>>> import penman
>>> t = penman.parse('(a / alpha :ARG0 (b / beta) :ARG1 (b / beta))')
>>> t
Tree(('a', [('/', 'alpha'), (':ARG0', ('b', [('/', 'beta')])), (':ARG1', ('b', [('/', 'beta')]))]))
>>> g = penman.interpret(t)
ignoring epigraph data for duplicate triple: ('b', ':instance', 'beta')

It would be tempting, then, to implement the check for duplicate edges on the tree structure, but the problem is that the tree is agnostic about inverted edges, so it would miss things like the following, where :ARG0(a, b) is duplicated:

(a / alpha :ARG0 (b / beta :ARG0-of a))

I think this check would therefore be good to add to penman.model.Model.errors(), which works on the graph (link to code).

Something like this:

+           triple_counts = Counter(graph.triples)
            g: Dict[Variable, List[BasicTriple]] = {}
            for triple in graph.triples:
+               if triple_counts[triple] > 1:
+                   err[triple].append('duplicated')

@jheinecke
Copy link
Author

I admit not to have delved too much in the code :-), but that looks OK to me if the user can retrieve this error somehow before the encoding the triple list. I also thought of case where the triple list comes from somewhere else, so no epigraph or whatsoever is present

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants